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Abstract 

I'liis  thchis  Hkiiscs  on  iindcrsianding  the  cost  of  invis/Mrcnl  inicrprocess  ammiunicolioii  in  a  distributed 
sysicin  consisting  of  a  set  of  machines  connected  by  a  UKat  network.  Iniciproccss  communication  is 
ininsparcnl  if  processes  can  communicate  without  regiird  to  physical  host  boundaries,  t  ransparent 
intcrpriK’ess  communication  is  a  very  powerfut  tiMtl  because  it  allows  us  to  view  tltc  collection  of  different 
machines  as  a  single,  logically  unified  computer  system.  We  concentrate  on  tlie  ejficicncy  aspects  of 
transparent  intei  priKcss  communication  on  a  local  network. 

In  order  to  obtain  experimental  evidence,  a  transparent  mes.sage-passing  mechanism  has  been  implemented  as 
part  of  the  JisiributcJ  V  kernel,  'fhis  message-passing  mechanism  has  been  used  as  the  basis  for  various 
distributed  applications.  In  particular,  it  has  been  used  extensively  for  providing  transparent  file  access  from 
diskless  workstations  to  a  set  of  network-based  file  servers.  Itased  in  part  on  experience  gained  from  the 
implemenuition  and  use  of  tlic  distributed  V  kernel,  ttris  titesis  presenLs  four  contributions: 

1.  An  empirical  evaluation  of  high-performance  mes.sagc  passing  on  a  liK'al  network. 

2.  A  queueing  network  model  of  file  access  front  diskless  workstations  over  a  local  area  network  to  a  set  of 
file  servers. 

3.  An  analysis  of  the  protixol  used  to  support  the  V  interprwess  communication  on  a  broadcast  network. 

4.  't  he  integration  of  the  broadcast  and  multicast  capabilities  of  local  area  networks  into  message-based 
interprocess  communication. 
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—  1  — 
Introduction 


1.1.  Introduction  to  the  Research  Area 

In  recent  years,  many  researchers  have  come  to  the  belief  that  a  large,  single-machine  limeshamg  system  is 
no  longer  the  most  effective  way  of  providing  computing  services  for  a  large  class  of  users.  Two  technological 
advances  have  stimulated  this  evolution: 

1.  Advances  in  Vi,5l  technology  have  significantly  reduced  the  cost  of  processors  and  memory,  to  the  point 
whcic  it  has  become  economically  feasible  to  dedicate  a  whole  prixessor  with  a  large  amount  of 
memory  to  a  single  user. 

2.  High-speed  kKal  networks  have  made  it  possible  to  interconnect  such  personal  machines  to  expensive 
pcri|)hcrals  such  as  printers  and  file  servers,  so  that  the  high  cost  of  tlicsc  peripherals  can  be  amortized 
over  a  large  number  of  users. 

As  a  result  of  these  technological  advances,  personal  coinpuierbascA  systems  have  become  popular:  each  user 
is  given  a  dedicated  personal  computer  which  is  connected  by  a  local  network  to  spcciali/cd  server  machines 
like  file  servers,  printers,  high-speed  computing  engines,  etc.  'llicsc  new  systems  have  relieved  several  of  the 
problems  inherent  to  large  single-machine  timesharing  systems: 

1. 1’criormancc  is  less  dependent  on  the  number  of  users  of  the  system.  Some  contention  remains  for 
access  to  the  shared  server  machines,  but  its  effects  arc  typically  less  pronounced  than  the  performance 
degradation  experienced  when  a  comparable  workload  is  presented  to  a  single-machine  timesharing 
system. 

2.  Ilic  system  is  easily  upgradable  in  terms  of  computing  power  by  simply  adding  one  or  more  extra 
processors. 

.1.  Ilie  dedication  of  a  whole  machine  to  a  single  user  has  provided  enough  spare  cycles  to  supply  the  user 
with  a  high-quality  user  interface. 

Howe\cr,  these  systems  require  explicit  action  to  be  taken  when  access  to  a  remote  machine  is  desired.  For 
example,  a  file  on  a  remote  file  server  cannot  be  accessed  by  tlic  same  prwcdiirc  as  the  one  used  to  access  a 
local  flic.  Instead,  a  file  (ransfer  program  has  U)  he  invoked  to  transfer  the  file  from  the  file  server  to  the  local 
machine,  before  normal  file  operations  can  be  performed.  In  general,  access  to  resources  is  not  iransparenl. 
Resources  on  other  machines  (such  as  files  or  virtual  terminals)  can  only  be  accessed  by  different  mechanisms 
tlian  tliosc  used  for  accessing  similar  resources  on  the  local  machine. 

Separating  tisers  onto  different  miK-hines.  with  no  transparent  way  to  communicate  between  them,  has 
significantly  reduced  the  amount  of  sharing  dial  can  be  ;iccomplishcd  among  users,  both  in  the  sense  of 
physical  sharing  of  resources  as  well  as  in  tlie  psychological  sense  of  sharing  a  single  environment.  In  a 
well-designed  timesharing  system,  tlie  degree  of  physical  sharing  is  nearly  complete:  every  prwessor  cycle, 
every  byte  of  memory,  every  sector  of  tlie  disk  can  in  principle  be  used  by  any  user.  In  a  personal  computer 
environment,  a  user  is  essentially  limited  to  the  resources  of  his  own  workstation.  Access  to  resources  on 
other  machines  is  either  too  awkward  or  too  costly  to  be  worthwhile.  Additionally,  timesharing  systems 
provide  a  strong  sense  of  community  among  tlieir  users,  particularly  because  of  tlie  ease  with  which  files  can 
be  shared  between  dificrent  users. 

Lick  of  transparency,  and  the  resulting  loss  of  sharing,  have  made  it  dirficuU  to  provide  personal  computer 
users  with  a  number  of  services  that  are  commonplace  in  good  timesharing  systems.  Some  specific  examples: 
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1.  In  a  timesharing  system,  the  executable  images  of  system  programs  are  stored  on  disk  exactly  once,  and 
this  single  copy  is  shared  by  all  users.  If.  as  is  common  in  tlte  personal  computer  model,  programs  can 
only  be  loaded  from  the  local  disk,  then  system  programs  need  to  be  replicated  on  all  workstations.  'Iliis 
results  in  a  serious  loss  of  (physical)  disk  sharing.  More  importantly,  replication  of  systems  programs  in 
such  an  uncontrolled  fashion  tends  to  cause  inconsistent  versions  of  these  programs,  slow  propagation 
of  updates,  etc. 

2.  It  is  often  the  case  that  the  code  segment  of  certain  system  programs  can  be  shared  between  different 
inv(Kations  of  the  program,  with  every  invocation  having  its  private  data  segment.  Iliis  is  obviously  not 
possible  between  personal  computers. 

3.  In  a  centralized  timesharing  system,  file  system  backup  is  done  for  the  users  by  a  system  operator.  Since 
it  is  often  not  possible  to  do  file  system  backups  across  the  network  and  most  personal  computers  arc 
not  outfitted  with  a  tape  drive,  personal  computer  users  must  explicitly  transfer  their  files  to  a  remote 
file  server  in  order  to  have  them  backed  up  on  a  regular  basis. 

4.  Finally,  while  having  complete  and  unlimited  access  to  the  resources  of  his  own  machine,  there  is  in 
general  no  convenient  way  for  a  user  to  make  use  of  the  computing  power  of  other  processors  in  the 
system,  even  if  they  are  idle.  Centralized  systems,  in  contrast,  can  typically  make  all  of  tlicir  facilities 
available  to  a  single  user  during  off  hours. 

It  is  sometimes  argued  that  these  problems  arc  purely  technological  in  nature,  in  tlic  sense  that  their 
importance  will  decrease  as  processors,  memory  and  disks  become  cheaper  and  more  powerful.  For  instance, 
it  is  argued  that,  as  personal  computers  become  powerful,  there  will  be  little  need  to  be  able  to  execute 
programs  on  other  machines.  Wo  sec  two  Haws  in  this  argument  First  it  is  to  be  expected  that  tlic  aspirations 
of  users  will  become  correspondingly  larger  and  therefore  will  exceed  the  capabilities  of  any  single  machine, 
however  large.  Second,  while  personal  computers  arc  good  for  computing  that  is  personal  in  nature  (such  as 
editing  a  private  file,  for  instance),  some  applications  inlicrcntly  do  not  fit  tliis  model  very  well,  in  particular 
those  requiring  frequent  access  to  shared  data.  Supporting  such  applications  in  a  personal  computer 
environment  will  remain  difficult. 

Notwithstanding  these  disadvantages,  the  pciuonal  computer  approach  has  been  shown  to  be  wtirkablc.  In 
an  environment  where  individual  users  regard  each  other  with  mutual  suspicion,  this  approach  is  actually 
quite  appropriate  because  of  the  inherent  autonomy  and  protection  between  users.  However,  the  V-System 
takes  an  alternative  viewpoint.  Kathcr  tlian  using  the  network  as  a  loose  connatirm  between  die  machines 
(much  like  a  miniature  long-haul  network),  we  prefer  to  view  tlic  network  as  an  extended  backplane, 
connecting  worksUilions  and  servers  in  a  multiprocessor  arrangement'.  The  V-Sysiem  then  allows  tlic 
different  machines  to  be  integrated  into  a  highly  parallel  but  logically  unified  computer  system,  attempting  to 
combine  tlic  advantages  of  personal  computers  with  tliosc  of  conventional  timesharing  systems. 

This  logical  unification  is  achieved  by  making  resource  access  transparent.  Hiis  makes  the  collection  of 
machines  appear  as  a  single  Uigical  entity  to  its  users.  In  V.  transparent  resource  access  is  based  on  the  use  of  a 
transparent  (message-based)  interprocess  communication  mechanism  tliat  provides  communication  between 
prcKcsscs  on  the  same  or  on  difrerent  workstations.  Ilic  rest  of  the  system  can  then  be  implemented  in 
approximately  the  same  way  as  tlic  message-based  systems  developed  for  single  machines,  regardless  of  tlic 
fact  that  some  processes  may  reside  on  other  machines. 

Let  us  briefly  review  some  of  the  drawbacks  we  mentioned  about  personal  computers  and  sec  how  this 
approach  citlicr  solves  or  at  least  alleviates  some  of  the  problems. 

1.  Since  interprocess  communication  is  transparent  programs  can  be  loaded  from  a  remote  (central)  file 
server  instc,id  of  from  die  l(x:al  disk.  By  extension,  no  local  permanent  storage  may  be  necessary  at  all. 
thereby  significantly  reducing  die  cost  of  a  worksuition.  System  programs  arc  now  shared  on  a  single 
machine,  dicrcby  accomplishing  some  amount  of  disk  sharing.  Moreover,  the  replication  problems 


1. 


Ilic  term  muUiproccssor  Is  not  intended  here  to  imply  the  use  of  shared  memory. 
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alluded  to  earlier  are  no  longer  present  Also,  since  all  user  files  reside  on  a  centralized  file  server,  file 
system  backups  can  be  done  in  the  traditional  way. 

2.  Some  code  sharing  can  occur  between  programs  in  the  following,  non-traditional  way.  Take  again  the 
file  system  as  an  example,  if  every  workstation  has  its  own  disk  and  consequently  its  own  file  system,  no 
code  sharing  occurs.  However,  if  there  is  a  single  central  file  server  and  all  workstations  use  this  file 
server  for  their  secondary  storage  acces.<i,  then  they  cfTcclivcly  share  the  file  server  code.  Since  this  code 
resides  on  the  remote  file  server  machine,  this  reduces  the  demands  on  tlic  processor  and  tlie  memory  of 
the  workstation  itself. 

3.  Programs  can  be  loaded  and  executed  on  other  workstations,  in  the  same  manner  as  on  the  local 
machine,  thereby  giving  users  potentially  more  than  one  processor  at  their  disposal. 

A  couple  of  comments  arc  called  for  at  this  point.  First,  any  communication  mechanism  that  is  transparent 
across  machine  boundaries  can  accomplish  the  above  goals,  whether  it  is  message-based,  pr<x:cdural,  based  on 
shared  memory  or  based  on  some  other  communication  paradigm.  Second,  all  of  the  above  advantages  are 
rendered  inconsequential  if  the  communication  mechanism  docs  not  perform  adequately.  We  concur  with 
Popek  et  al.  159]  that  inejficienl  transparency  is  no  transparency.  Indeed,  if  the  cost  of  remote  operations  is 
substantially  higher  than  the  cost  of  the  corresponding  liKal  operations,  then  tlic  applications  programmer  has 
again  to  distinguish  between  the  two.  thereby  undoing  most  of  the  intended  benefits  of  a  transparent 
mechanism.  The  V-System’s  interprocess  communication  performs  sufficiently  well  to  provide  efficient 
transparent  access  for  most  applications,  'fhe  mechanisms  for  accomplishing  such  performance  fonn  tlic 
principal  topic  of  this  thesis. 


1.2.  Overview  of  the  V-System 

1.2.1.  Hardware  Environment 

The  V-System  runs  on  a  set  of  SUN  workstations  (6)  interconnected  by  a  3  Mb  Ftlicrnct  (54)  or  a  10  Mb 
Htlicrnct  (27).  I'hc  SUN  workstation  is  a  68000-bascd  machine,  typical  of  many  workstations  currently  on  the 
market.  It  provides  (See  also  P'igurc  l-l); 

1.  Approximately  1  Mil’s  of computing  power 

2.  Up  to  2  Megabytes  of  physical  main  memory 

3.  Support  for  separate  virtual  address  spaces  up  to  2  Megabyte 

4. 1000x1000  bitmap  display,  corresponding  frame  buffer  with  hardware  assists  for  rasterop  operations  and 
a  mouse 

5.  3  Mb  Hthernct  or  10  Mb  Fthcrnct  interface. 

Hue  to  the  limiuitions  of  the  68000  processor,  no  demand  paging  is  done,  'fhe  system  has  been  in  daily  use 
in  tliis  configuration  for  almost  two  years  now.  At  tJic  time  of  writing,  the  V-System  is  being  jiortcd  to  the 
next  generation  SUN,  with  a  68010  priKCSsor  which  docs  allow  demand  paging.  At  the  same  time, 
implementations  on  the  Vax  11/750  and  the  IRIS  workstation  (22)  are  under  way. 

1.2.2.  Software  Environment 

The  V-System  software  follows  the  conventional  model  of  many  message-based  s-'stcms(14, 36. 81). 
I  ogically.  it  consists  of  a  distributed  kernel  and  a  set  of  server  priKCS,scs,  possibly  running  on  dedicated  server 
machines.  I'lic  distributed  kernel  itself  consists  of  the  colla'tion  of  kernels  resident  on  each  participating 
machine  (See  l•'igurc  1-2).  The  kernel  is  essentially  a  communications  server:  it  provides  communication 
between  processes  and  very  little  else.  Many  functions  present  in  the  kernel  of  other  operating  systems  (file 
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Figure  I'l:  Sun  Workstation 


services  c.g.)  arc  perfonned  outside  of  the  kernel  in  the  V-System.  Functions  other  than  interprocess 
communication  whicli  require  kernel  support  (process,  memory  and  device  management)  arc  provided  by  a 
pseudo-process  running  inside  the  kernel.  Through  a  low-overhead  iiiierkcrnel  protocol  die  di^crent  kernels 
cooperate  to  provide  transparent  communication  between  prcxrcsses  on  different  machines.  Servers  currently 
include  device  servers,  storage  servers,  virtual  graphics  lenninal  servers  and  network  servers.  l>;vicc  servers 
arc  mostly  implemented  as  pscudo-priKcsscs  in  the  kernel:  in  a  typical  configuration  tlicy  allow  access  to  the 
network,  the  keyboard,  tlie  mouse  and  die  graphics  frame  bulTer.  All  other  servers  are  implemented  as 
(collections  oO  processes  outside  of  die  kernel.  The  internet  server  |.W)  permits  its  clients  to  communicate 
with  any  host  or  program  accessible  via  the  Xerox  Pun  | II)  or  the  ArI’a  Internet  proltvols  (ftO).  Ihc  virtual 
graphics  terminal  server  1.17,  38)  supports  object-oriented  graphics  intemetion  in  a  device-independent 
manner.  Ihc  storage  server  provides  access  to  a  hicraa'hically  .sinictured  file  system. 

Ihc  V  kerncTs  interprocess  communication  has  been  modeled  after  that  of  rholh(14.  16]  and  Vcrcxl49J, 
with  some  inllucnce  from  l)i  Mt)s|5J.  Communication  between  proccs.ses  is  provided  in  die  form  of  short 
lixed-lcngth  messages,  each  with  an  associated  reply  messiige,  plus  a  data  transfer  operation  for  moving  larger 
amounts  of  data  between  prtKCSses.  The  communication  primitives  promote  a  r/ic/i/-5ervrr  style  interaction. 
'Ihc  common  communication  scenario  is  as  follows  (See  Figure  1-3):  a  clieiti  process  executes  a  Send  to  a 
server  prrxrcss.  which  dicn  completes  execution  of  a  Receive  to  receive  the  mcs.sagc  and  eventually  executes  a 
Reply  10  respond  with  a  reply  message  back  to  the  client.  Ihc  receiver  may  execute  one  or  more  MoveToox 
Movet'rom  data  transfer  operations  between  die  time  the  message  is  received  and  die  time  die  Reply  message 
is  sent 

A  set  of  standard  library  routines  provides  a  prrKedural  interface  to  die  messiigc  passing  primitives.  'Ihus, 
kernel  operations  such  as  Send,  Receive,  Reply,  MoveTo  ond  Moveh'rom  typically  appear  as  prixredure  calls  in 
user  code,  Ihc  corresponding  library  routine  then  invokes  the  kernel  via  a  trap  mechanism.  However,  many 
application  pnigrams  prefer  to  handle  their  I/O  connections  as  byte  streams  radicr  than  having  to  deal  with 
the  low  level  message  interface.  In  V.  a  reliable  block  stream  is  provided  by  means  of  die  V  I/O  protocol  [131, 
cs.scntially  a  set  of  conventions  on  the  format,  the  contents  and  die  sequence  of  messages  to  be  exchanged 
between  clients  and  servers.  Additionally.  Uicrc  arc  two  library  packages  of  interest  to  this  discussion:  one 
that  implements  a  byte-stream  interface  to  die  V  I/O  prouxroTs  bUx;k  stream,  and  a  second  diat  provides 
essentially  the  sjime  interface  as  die  Unix  C  library  (Sec  l-igurc  1-4).  Clients  typically  use  the  latter  library 
packages,  which  Uien  perfomi  die  appropriate  translation  into  V  messages,  berveiii  use  die  Receive  and  Reply 
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Figure  1-2:  'llie  V-Syslcm:  Overview 


kernel  primitives  directly  lo  iinplcincnl  the  I/O  proUKol.  It  is  essential  to  recognize  that  the  stream  protocol 
is  implemented  outside  of  the  kernel.  That  is,  die  kerne!  only  transmits  the  mcss;igcs  between  processes, 
irrespective  oftlicir  contents  or  tlicir  sequence. 


1.3.  Thesis  Outline 

ITie  ideas  of  this  dissertation  are  developed  as  follows.  Chapter  2  presents  a  brief  overview  of  related  work. 
Chapter  3  defines  the  communication  primitives  available  at  tlic  kernel  interface,  and  presents  the  results  of 
an  empirical  performance  evaluation  of  die  kernel.  'Ihe  performance  evaluation  covers  the  performance  of 
the  kernel  by  itself  (in  terms  of  message  passing  times),  the  performance  of  file  access  when  implemented  on 
top  of  die  V  kernel  and  the  performance  of  other  important  interprocess  communication  patterns.  Chapter 
4  uses  daUi  from  the  perfonnance  measurements  of  Chapter  3  as  input  to  a  queueing  network  model  of 
network  page-level  file  access.  I'hc  kernel  implementation  and  the  inlerkcrnel  proUK'ol  arc  described  in 
Chapter  5.  I’crformance  tradeolfs  between  various  impicinentaiion  strategies  are  brought  forth  and  wherever 
possible  quantified.  In  Chapter  6  we  extend  the  one-to-one  interprocess  communication  of  the  V  kernel  to 
one-to-many  communication  drawing  on  the  multicast  capabilities  of  many  liKal  networks.  Finally,  in 
Chapter  7  we  summari/c  die  contributions  of  diis  diesis  and  explore  avenues  for  further  work. 
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—  2  — 
Related  Work 


I'hc  purpose  of  this  chapter  is  to  provide  the  reader  with  a  few  representative  examples  of  related  efforts,  so 
he  can  put  tlic  work  of  this  thesis  in  the  appropriate  context.  It  is  not  intended  to  give  an  extensive  survey  of 
distributee  systems  research  in  tlic  last  decade.  This  overview  is  also  restricted  to  systems  that  have  been 
implcmenicd  and  arc  known  to  work.  The  chapter  is  organized  as  follows:  in  each  section,  a  particular  class 
of  systems  is  covered.  Systems  belonging  to  die  class  arc  mentioned  and  then  a  prominent  example  of  the 
class  is  singled  out  and  used  as  a  vehicle  for  further  discussion.  Following  is  a  list  of  the  classes  discussed  and 
their  flag  fiearcr  implementations: 

1.  Nor -transparent  distributed  systems:  Arpanet 

2.  Streams  and  specialized  file  protocols:  IjOCUS 

3.  Hnhanced^  message  passing:  Accent 

4.  Reduced  message  passing:  ’llioth 

5.  Remote  pnxrcdurc  calls:  Cedar  RPC 

6.  Remote  memory  references:  Spcctor 

7.  Virtual  memory-oriented  systems:  Apollo  IX)MAIN 

8.  Object  based  systems:  ltden 

One  other  system  that  docs  not  fit  any  particular  class,  the  Cambridge  Distributed  System,  is  also  briefly 
discussed. 


2.1.  Non-transparent  Distributed  Systems 

The  Aipanct  [53].  the  oldest  example  of  a  distributed  system,  is  probably  also  tlic  best  example  of  a 
non-transparent  distributed  system.  Resources  on  other  machines  (such  as  files  or  virtual  lenuinals)  can  only 
be  accessed  by  different  mechanisms  titan  tliosc  used  for  accessing  similar  resources  on  tlic  local  machine. 
Although  some  .ittempts  in  tlic  way  of  transparency  have  been  made  |33.  74J.  too  many  odds  had  to  be 
overcome.  I'hc  long-haul  nature  of  tlic  network,  tlic  basic  protocol  structure,  whereby  hosts  regard  each  other 
with  mutual  suspicion  [531.  heterogeneity  of  the  different  machines  and  operating  systems  have 

precluded  significant  progress  in  tliat  direction. 

Most  commercial  systems  liavc  carried  tliis  tradition  forward,  allhough  tlicy  use  high-bandwidth,  low-delay 
l(Kal  network  technology.  Much  of  the  early  Alto  software  (731  is  in  this  class  and  also  some  of  tlic  ensuing 
work  on  the  Xerox  Star|31)  system.  Rased  on  a  desire  to  provide  a  large  degree  of  autonomy  to  individual 
nodes  in  tlic  network,  some  of  the  work  done  at  Mn  has  also  shied  away  from  the  notion  of 
transparency  [19.  21). 

Due  to  hardware  constraints  or  explicit  policy  decisions,  these  systems  aim  for  a  much  more  loosely  coupled 


^We  u.'ic  the  icmis  rnluimrJ  and  reduced  witli  respect  to  message  passing  in  Uxise  analogy  to  the  way  they  have  bceonie  known  for 
niaehine  instruetion  sets. 
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environment  than  the  V-System.  Since  they  arc  so  difTcrent  both  in  scope  and  motivation,  they  arc  not 
discussed  further. 


2.2.  Streams  and  Specialized  File  Access  Protocols 

Perhaps  a  more  catchy  header  for  this  section  would  be  distributed  Unix  systems.  Many  such 
systems  [50. 62. 66]  have  been  built  or  arc  being  built,  largely  because  of  the  popularity  of  Unix  [65]  as  a 
single-machine  operating  system.  We  use  I  jOCUS  [59. 75]  as  the  representative  system  of  this  class. 

In  terms  of  hardware,  the  current  lxx:us  system  is  geared  somewhat  more  towards  a  network  connecting  a 
number  of  general  purpose  multi-user  machines  while  the  V-System  adheres  more  to  a  model  of  single  user 
workstations  and  dedicated  server  machines.  The  1.0CDS  kernels  on  different  machines  cooperate  to  give  the 
illusion  of  system-wide  transparency,  similar  to  the  distributed  V  kernel.  Ihc  I.OCUS  intcrkcrnci  protocol  is 
really  a  collection  of  protocols,  each  of  which  is  specialized  for  a  particular  application  (such  as  file  access, 
remote  tasking,  etc.)  In  contrast,  the  V  intcrkcrnci  protocol  supports  interprocess  communication. 
Applications  Uten  use  interprocess  communication  as  the  base  layer  for  building  their  application-dependent 
prot(Kols.  Of  particular  interest  in  the  Locus  interkernel  protocol  is  the  specialized  file  access  protocol 
specifically  optimized  for  high  performance  sequential  file  access.  In  order  to  further  optimize  file  access,  the 
file  system  is  part  of  the  LOCUS  kernel.  Although  different  in  the  communication  model  it  emphasizes,  the 
1  .OCUS  intcrkcrnci  protocol  is  similar  to  that  of  die  V-System  in  stripping  away  proUKol  layers  for  increased 
performance.  Locus  serves  as  a  frequent  point  of  comparison  in  this  thesis,  both  in  tenns  of  the  primitives  it 
supports  as  well  as  in  terms  of  the  system's  performance. 


2.3.  Enhanced  Message  Passing 

Systems  in  this  class  all  have  common  ancestry  in  the  single  machine  Di^mos  [5]  operating  system, 
developed  at  Los  Alamos  for  the  Cray-l.  Among  its  distributed  descendants  arc  Arachnc[70]  at  the 
University  of  Wisconsin.  Sons  [68. 69]  at  die  University  of  Delaware.  I)l  M0S/MI’[61]  at  llcrkclcy  and 
Accent  [6.1]  at  Cmu.  'Hie  latter  was  developed  drawing  heavily  on  die  experience  gained  during  the 
development  of  die  Ric;  [36]  operating  system  at  the  University  of  Rochester  and  also  incorporates  several 
ideas  from  die  MuUics  file  system  (26.  58).  It  is  uikcn  as  die  representative  of  this  group. 

All  these  systems  adhere  to  die  same  basic  software  structuring  paradigm  as  the  V-System  in  that  they 
consist  of  a  communications  kernel  and  a  set  tif  server  processes  diat  provide  other  system  services.  I  lowevcr, 
the  communications  intcrfticc  dial  diese  kernels  provide  is  rather  sophisticated.  Accent,  in  particular, 
supports  protected  message  paths,  indirect  process  addresses  (ports),  asynchronous  message  communication, 
arbitrary  size  mcs,sagcs  and  structured  message  formats.  I'hc  value  of  providing  all  these  features  at  die  kernel 
interface  has  to  be  weighed  against  potential  penalties  in  complexity  and  performance. 


2.4.  Reduced  Message  Passing 

I'he  seminal  system  in  diis  class  is  the  I'hoth  kernel  [14. 16].  Ilic  goals  of  die  Tholh  project,  done  at  the 
University  of  WatcrKxi.  were  two-fold:  to  experiment  with  the  multi-process  structuring  of  software  and  to 
gain  insight  in  system  software  portability.  I'hotli  introduced  many  of  the  concepts  present  in  V.  in  pardcular 
the  idea  of  synchronous  mcs.sagc  passing.  'Ihc  Ihoth  kernel  was  implemented  on  a  number  of  machines, 
including  a  Nova/2  and  a  'l'l-9‘X).  Verex  [49].  a  descendant  of  die  original  system  done  at  the  University  of 
British  Columbia,  also  ran  on  a  master-slave  multiproccsstir  configuration.  Ihc  V-System  interprocess 
communication  has  been  modeled  after  dial  of  Ihoth,  although  a  number  of  modifications  were  made 
showing  an  evolution  to  a  more  sophisticated  kernel  interface. 
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2.5.  Remote  Memory  References 

Wc  take  Spcctor’s  thesis  work  as  a  representative  sample  of  this  category  [71, 72],  Another  example  is  the 
work  done  by  Leblanc  [45, 46]  in  the  context  of  the  StarMod  programming  language  [23],  although  his  work 
includes  many  other  models  of  interprocess  communication. 

Spcctor's  work  again  attempts  to  provide  the  s<imc  sort  of  integration  as  the  V-System.  Although  similar  in 
its  goats.  Spcclor  proposes  a  fundamentally  different  intcrconncctimi  mechanism,  known  as  die  remote 
memory  reference  model.  All  machines  in  the  system  share  a  common  address  space  and  any  machine  can 
transparently  make  memory  accesses  to  any  kKation  in  that  address  space,  regardless  of  which  machine  the 
actual  physical  memory  location  resides  on.  Again,  as  in  IXXTLS  and  in  V,  layering  of  proUKols  is  avoided  for 
pcrfonnancc  reasons,  both  Spcctor's  and  Leblanc's  theses  give  extensive  performance  measurements  of  their 
respective  communication  primitives. 


2.6.  Virtual  Memory-Oriented  Systems 

'fhe  representative  system  in  this  class  is  the  Apollo  Domain  system  [44).  ITic  distinctive  features  of  the 
IXlMAlN  include;  a  network-wide  file  system  of  objects  addressed  by  unique  identifiers  (Uins);  a  single-level 
store  for  transparently  accessing  all  objects,  regardless  of  their  location  in  die  network;  and  a  network-wide 
hierarchical  name  space.  A  unique  aspect  of  the  IX)MA1N  system  is  its  network-wide  single-level  store 
(Multics  [26]  is  an  example  of  a  single-level  store  for  a  ccntrali/cd  system.)  Programs  access  objects  by 
presenting  their  Dins  and  asking  them  to  be  mapped  into  dicir  address  space.  Subsequently,  dicy  arc 
accessed  with  ordinary  machine  instructions,  utilizing  virtual  memory  demand  paging.  The  purpose  of  the 
single-level  store  is  not.  unlike  the  systems  referred  to  in  die  previous  section,  to  create  network-wide  shared 
memory  semantics  akin  to  diat  provided  by  a  closely  coupled  multiprcKCSsor.  Instead,  it  is  a  form  of  lazy 
evaluatiem:  oidy  required  portions  of  objects  arc  retrieved  from  disk  or  over  the  network.  Another  purpose  is 
to  provide  a  uniform,  network  transparent  way  to  access  all  objects:  the  mapping  operation  is  independent  of 
whether  the  Ull>  is  for  a  liKal  or  remote  object.  Wc  note  that,  as  an  alternative  to  its  single- level  store,  the 
1X)MAIN  system  also  supports  network  transparent  interprocess  communication  through  so  called  sockets. 


2.7.  Remote  Procedure  Calls 

The  idea  of  remote  procedure  calls  is  quite  appealing:  pnKcdurc  calls  arc  the  predominant  mechanism  for 
transfer  ofconirol  between  programs  or  program  modules  on  a  single  machine.  It  is  simply  proposed  diat 
their  semantics  be  extended  across  machine  boundaries.  I'hc  first  mention  of  rciiiotc  pro-ccdurc  calls  dates 
back  to  1976  and  was  made  in  die  context  of  the  Arpanet  network  [77j.  Their  applicability  to  local  network 
environments  was  first  explored  in  Nelson's  thesis [57].  Nelson  suggested  various  semantic  definitions  and 
explored  die  performance  tradeoffs  of  different  implementation  strategics.  The  fii-st  full-fledged 
implementation  was  carried  out  by  Dirrcll  and  Nelson  in  the  framework  of  the  Cedar  programming 
environment  at  Xerox  PARC  [9].  Their  implementation  runs  on  Dorados  [35]  interconnected  by  a  3  Mb 
Kthcrnct  [54].  The  Cedar  Ki’C  implementation  again  depends  heavily  on  specialized  proUx;ols  for  adequate 
pcrfonnancc.  Wherever  possible,  wc  compare  its  pcrfonnancc  with  the  measurements  obtained  for  the 
V-System. 

2.8.  Object-Based  Systems 

I'hc  object  model  has  long  been  a  popular  program  structuring  tool.  It  has  its  r<K)Ls  as  far  back  as  the 
Simula  language  [24].  Among  the  object  based  distributed  systems  of  interest  arc  Hdcn  [3.  41],  Argus  [47, 48] 
and  Clouds  [2]  (While  these  arc  "pure"  object  systems,  virtually  all  server-based  systems  can  be  considered 
variants  of  the  object  model.) 

'Hie  Kdcn  project  addresses  the  construction  of  an  environment  similar  to  that  of  tlic  V-System.  Its 
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emphasis  is.  however,  much  more  on  programmability  ih.in  on  performance,  llie  Kden  world  consists  of  a 
virtual  spiice  of  typed  objects,  possibly  residing  on  dilTcrent  machines  and  communicating  via  a  fairly  heavy- 
handed  invocation  mechanism.  Within  the  Kden  objects,  multiple  concurrent  processes  can  coexist  The 
system  is  currently  implemented  on  a  guest-level  to  Unix  4.1BSD.  Argus  and  Clouds  are  similar  to  Kden  in 
philosophy.  Unlike  Rden  though,  they  both  support  atomic  transactions  on  objects  at  the  kernel  level. 


2.9.  The  Cambridge  Distributed  System 

ITie  Cambridge  distributed  system  mtxlel  [56. 78]  also  takes  the  view  of  a  local  network  as  an  extended 
backplane  between  machines.  Ihe  local  network  in  their  case  is  die  Cambridge  ring  [79],  There  are  dedicated 
servers  on  the  ring,  but  there  is  no  explicit  idea  of  personal  workstations,  at  least  in  tJie  traditional  sense. 
Instead  tliere  is  a  processor  bank,  consisting  of  a  number  of  processors  that  are  in  principle  available  to  any 
user.  When  a  user  then  logs  in  to  the  network  (through  a  terminal  concentrator  connected  to  die  net),  he  gets 
allocated  a  processor  out  of  the  processor  bank,  if  needed  and  if  extra  processors  are  available,  more 
processors  can  be  allocated  to  a  single  user. 

2.10.  Chapter  Summary 

Hxcept  for  die  Arpanet,  all  systems  surveyed  in  this  chapter  share  the  goal  of  die  V-System  to  create  a  single 
logical  entity  out  of  several  loosely  connected  machines.  A  number  of  systems  have  specifically  addressed  the 
problem  of  high-performance  communication,  in  particular  the  l-OCLS  system  and  Cedar  Rw:.  IXXUS 
achieves  high  performance  by  die  use  of  proUKols  specifically  tuned  to  certain  applications,  in  particular  file 
access.  In  contrast,  the  V-System  provides  a  single  set  of  interprocess  communication  primitives  and  a  single 
supporting  prt’tocol,  which  is  dien  used  by  applications  as  a  base  layer  for  structuring  their  application- 
specific  communication  needs.  Ilie  question  naturally  arises  whether  this  interposition  of  an  extra  layer  docs 
not  compromise  the  performance  of  certain  key  applications  such  as  file  acccs.s.  Iliis  question  is  addressed  in 
the  next  chapter,  where  we  present  the  V  interpriKess  communication  primitives  <ind  their  pcrfonnance.  Ilie 
Ri\'  model  preseiiLs  a  similar  approach  to  diat  of  the  V-System  in  presenting  a  single  set  of  primitives  to 
applications  rather  than  application-dependent  primitives  and  protocols.  However,  current  work,  in 
p.irticular  Nelson's  thesis  |57j.  has  been  more  concerned  with  the  eflccts  on  performance  of  implementing  the 
Rl*t'  proUKol  at  various  layers  of  a  protiKol  hierarchy  radier  than  with  die  performance  of  applications 
implemented  \tn  lop  of  RlK:. 
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—  3  — 

The  V  Kernel  Primitives  and  their  Performance 


3.1 .  Introduction 

So  Hir,  we  li.ive  described  in  general  terms  the  research  area  and  the  V-Syslem.  which  functions  as  die 
concrete  research  setting.  ITie  goal  of  the  V-System  is  the  construction  of  an  integrated  distributed  system, 
whereby  host  boundaries  are  hidden  from  the  user.  We  have  indicated  how  die  design  of  an  efficient 
transparent  interprocess  communication  mechanism  --  the  topic  of  this  diesis  --  is  an  important  step  towards 
achiCMiig  that  goal.  In  this  chapter  we  define  the  interprocess  communication  facilities  available  at  the  V 
kernel  interface  and  we  present  the  results  of  an  empirical  performance  evaliuition  of  die  kernel  (A 
nrelimmaiy  version  of  diis  chapter  appeared  in  [17|.)  The  figures  in  this  chapter  are  in  a  sense  raw  numbers; 
dicy  are  obtained  by  conducting  experiments  in  a  given  hardware  environment  and  in  an  otherwise  idle 
sysiem.  In  the  next  chapter  we  use  the.se  raw  numbers  as  inputs  to  die  construction  of  a  queueing  network 
model  of  network  page-level  file  access.  That  way  we  can  as.scss  the  effects  of  changing  hardware  parameters 
or  increasing  load  on  die  performance  measurements  presented  in  this  chapter. 

Ibis  chapter  is  organized  as  follows:  Section  3.2  contains  die  definition  of  the  V  kernel  interprocess 
communication  primitives.  Section  .3.3  shows  die  application  of  these  primitives  in  a  simple  example.  In 
Section  3.4,  we  describe  our  measurement  methods  and  we  char;ictcri/.c  the  hardware  environment  in  which 
the  experiments  took  place.  Next,  in  Section  3.5,  we  give  performance  measurements  of  individual  kernel 
primitivc>^.  l-inally.  die  bulk  of  diis  chapter  is  taken  up  by  Sections  3.6  and  3.7  in  which  we  discuss 
performance  figures  for  two  important  applications  built  on  top  of  die  V  kernel,  namely  file  access  and  pipe 
data  transfer. 


3.2.  The  V  Kernel  Interprocess  Communication  Primitives 

In  V.  puiccsscs  arc  identified  by  a  32-bit  unique  idcniificr,  called  the  /mnr.v.v  iilcuiificr.  They  communicate 
via  the  synchronous  exchange  of  mcss;igcs.  l-'vcrv  request  has  a  reply  assix.'i.itcd  with  it  and  die  sender  of  a 
mcss.igc  blocks  until  the  mcss;igc  is  received  and  replied  to.  Mcssiigcs  arc  normally  small  and  fixcd  si/c  (8 
32-bil  words)  although  under  some  circumslanccs  a  segment  of  data  can  be  appended  to  a  mess;igc.  Iliis 
segment  is  vai  i.iblc  in  si/e  up  to  a  fixed  maximum.  Ihcre  is  a  separate  data  transfer  facility  for  moving  larger 
amounts  of  data.  Additionally,  there  arc  primitives  for  low-level  name  registration  and  lookup,  and  for 
querying  the  existence  of  a  prcKCSs.  The  complete  set  of  interprocess  communication  facilities  follows^ 

Send  (  message,  process  fd  ) 

Send  the  mcss;igc  in  message  to  the  process  with  process  identifier  processld,  blocking  the 
active  prixcss  until  die  message  is  both  received  and  replied  to. 

The  kernel  legislates  a  mcssiigc  format,  by  which  a  process  specifics  in  the  mcssiigc  die 
scgincnl  of  its  address  space  dial  the  mcssjigc  recipient  may  access  and  whether  die 
recipient  may  read  or  v/ritc  that  scgniciU.  A  segment  is  specified  by  die  Iasi  two  words  of  a 
message,  giving  its  start  address  and  its  length  respectively,  Keserved  flag  bits  at  the 
beginning  of  the  message  indicate  whether  a  .segment  is  specified  and  if  .so,  its  access 
pennissions. 


3 


ITic  dcfiniiions  of  ihe  pnmilivcs  in  this  ch.ipier  .arc  taken  from  [7]  with  some  omissions  for  the  sake  of  brevity. 
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It  is  intended  and  assumed  that  most  logical  requests  can  be  assigned  a  request  code  that  is 
stored  in  the  first  word  of  the  request  messtige  so  tliat  the  bits  are  set  correctly  for  the 
request  by  the  value  of  tlic  request  code. 

( processid,  byteCount )  =  Receive  (  message,  segment  Pointer,  segmentSize  ) 

Suspend  the  active  pnxrcss  until  a  ntessiigc  is  available  from  a  sending  process.  Return  the 
processid  of  tliat  pnxrcss,  leave  the  message  in  the  array  pointed  to  by  message  and  at  most 
the  first  segmentSize  bytes  of  tlic  segment  included  with  the  message  in  die  receiver's 
address  space  starting  at  segmentPointer.  I'hc  actual  number  of  bytes  received  in  the 
segment  is  returned  in  byteCount.  Messages  arc  queued  in  fiistcomc.  first  served  order. 

processid!  =  ReceiveSpecific  (  message,  processidl ) 

Suspend  the  active  process  until  a  message  is  available  from  the  prtKcss  specified  by 
processidl,  returning  tlic  process  identifier  of  tliis  process  in  processid!  and  the  message  in 
the  array  pointed  to  by  message.  If  processidl  is  not  a  valid  prrxcss  identifier, 
ReceiveSpecific  returns  0.  ReceiveSpecific  is  almost  exclusively  used  in  the  latter  fashion, 
to  query  the  existence  of  process  processidl. 

Reply  (  message,  processid,  destinationPointer,  segmentPointer,  segmentSize  ) 

Send  tlic  reply  message  specified  by  message  mA  the  segment  beginning  at  segmentPointer 
of  length  segmentSize  to  the  prexess  specified  by  processid.  The  specified  prcxcss  must  be 
awaiting  reply  from  the  active  prexess.  and  if  die  segment  si/c  is  different  from  zero,  the 
appropriate  access  rights  must  have  been  given  as  described  under  Send.  The  reply 
message  overwrites  the  original  message  in  the  sender  and  the  scgmcni.  if  present,  is  placed 
at  destinationPointer \n  the  sender’s  address  space. 

Forward  (  message,  processidl,  processid!  ) 

Korward  die  message  pointed  to  by  message  to  the  prtxcss  specified  by  processid!  as 
though  it  had  been  sent  by  the  prexess  processidl.  processidl  must  be  awaiting  reply  from 
the  active  pnxess.  I'hc  effext  of  this  operation  is  the  same  as  processidl  sending  directly  to 
processid!  except  for  the  active  prexess  being  noted  as  the  forwarder  of  die  mcssiigc.  Note 
dial  fowort/docs  not  bUxk. 

MovcFrom  ( source Pid,  destinationPointer,  sourcePointer,  byteCount ) 

Copy  byteCount  bytes  from  the  scgmcni  starting  at  sourcePointer  in  the  address  space  of 
sourcePid  to  the  scgmcni  starling  at  destinationPointer  in  the  active  pnxess's  address 
space.  I  he  sourcePid  pnxess  must  be  awaiting  reply  from  the  active  prexess  and  must 
have  provided  read  .icccss  tei  die  appropriate  segment  of  its  address  space  using  the 
mcssiigc  formal  cemvcniions  described  under  Send. 

MoveTo  ( destinationPid,  destinationPointer,  sourcePointer,  byteCount ) 

Copy  byteCount  bytes  from  die  segment  starting  at  sourcePointer  in  the  active  prexess’s 
address  space  tei  the  segment  starting  at  destinationPointer  in  die  address  space  of  the 
destinationPid  prexess.  I'he  destinationPid  prtxcss  must  be  awaiting  reply  from  the  active 
prtxcss  and  must  have  prewided  write  access  to  die  appropriate  segment  of  its  address 
space  using  die  mcs,s;igc  fiirmat  cenivcnlions  described  under  Send. 

SetPid  ( logicalld,  processid,  scope  ) 

Associate  processid  with  the  specified  logicalld  within  the  specified  scope.  Subsequent 
calls  to  GetPid  with  this  logicalld  and  appropriate  scope  return  this  processid.  Ilic  scope  is 
one  of  local  remote  or  both. 

processid  =  GetPid  ( logicalld,  scope  ) 

Return  dieyirocm.iofthe  prtxcss  registered  using  SetPid  with  die  specified  logicalld  and 
scope,  0  if  not  set.  I  he  scope  is  one  of  foca4  reaio/c  or  Aort. 
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3.3.  Example  of  Use 

At  this  point  an  example  of  use  is  in  order.  Readers  familiar  with  the  concepts  of  the  V-System  can  easily 
skip  this  section  and  proceed  immediately  to  Section  3.4.  The  example  uses  a  loose  variation  of  the  C 
programming  language  [34).  While  the  example  is  kept  simple  for  the  sake  of  brevity,  we  show  how  it  can  be 
extended  to  more  realistic  applications. 

3.3.1.  Specification  and  Implementation 

We  wish  to  implement  a  server  prrKess  that  responds  to  two  client  requests.  GetData  and  PutData.  The 
server  has  a  single  data  buffer,  from  which  the  data  is  to  come  on  a  GetData.  and  into  which  the  data  is  to  be 
placed  on  a  PutData.  Clients  are  expected  to  specify  in  their  requests  the  si/e  of  tlie  data  transfer  they  expect 
to  happen.  Thus,  on  a  PutData  with  a  certain  size,  the  buffer  is  filled  up  to  that  si/e  with  the  corresponding 
data,  llie  contents  of  the  rest  of  the  buffer  is  undefined.  On  a  subsequent  GetData.  the  contents  of  the 
buffer  is  returned  up  to  the  size  requested  in  the  GetData.  or  up  to  tlic  size  of  tlie  preceding  PutData, 
whichever  is  smaller. 

Figure  3-1  lists  some  type  definitions  that  form  the  interface  between  client  and  server.  Figures  3-2  and 
3-3  show  the  server's  main  routine  and  some  auxiliary  routines,  respectively.  A  client  executing  a  >*utData 
followed  by  a  GetData  is  shown  in  F'igurc  3-4.  A  discussion  follows  in  Section  3.3.2. 


/•  C11«nt  -  Sar¥»r  Interface  •/ 

RequestCoda  ■  anue  (  getOata.  putOata  ): 

ReptyCoda  ■  anu*  (  requestTooBIg.  notInp1e*entad.  OK  ); 

/*  Format  of  the  messages  between  client  and  server  */ 

Request  >  record 

{ 

RequestCoda  raquestcode; 

char  •bofterftr; 

unsigned  bufferSIze; 

) 

Reply  ■  record 

{ 

ReplyCode  replycode; 

unsigned  bufferSIze; 

) 

Figure  3-1:  Qicnt  -  Server  Interface 
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3.3.2.  Discussion 

3.3.2. 1  Server  Implementation 

Figure  3-2  shows  a  typical  construction  for  a  simple  server  in  the  V-System.  First,  the  server  initializes  its 
data  structures  (In  this  simple  case  this  is  only  the  variable  current BufferSize),  then  announces  itself  by 
executing  a  SetPid.  Then  it  goes  into  a  loop  waiting  for  incoming  messages.  Depending  on  the  request  code 
in  an  incoming  nicss;ige.  appropriate  action  is  token  to  satisfy  this  request.  When  the  incoming  request  is  a 
GetData.  the  server  uses  a  MovcTo  to  move  the  data  from  its  buffer  to  the  client's  buffer.  For  a  PutData,  a 
MoveFrom  accomplishes  the  data  transfer  in  the  other  direction.  Finally,  the  server  puts  a  reply  code  in  the 
reply  message,  and  executes  a  Repfy. 

3.3. 2. 2  Client  Implementation 

'llte  client  first  finds  out  the  process  identifier  of  the  server  by  executing  a  GetPid  (See  Figure  3-4).  It  then 
fonnaLs  the  appropriate  messages  and  sends  them  off  to  the  server.  The  exact  value  of  tlie  GetData  and 
PutData  request  codes  is  not  shown  here.  Appropriate  bits  in  both  request  codes  have  to  be  set  such  tliat  the 
client  provides  access  to  the  segment  of  its  address  space  containing  the  data  buffer,  read  access  in  the  case  of 
PutData  and  write  access  in  the  case  of  GetData.  This  is  necessary  to  allow  the  server’s  MoveTo  and 
MoveFrom  operations  to  succeed. 

In  practice,  clients  would  seldom  use  the  message  passing  primitives  explicitly,  as  suggested  in  this  example. 
Typically,  with  each  server  there  exists  a  set  of  library  routines,  which  provide  a  procedural  interface  to  the 
server’s  message  interface.  Thus,  clients  would  call  such  a  library  routine,  which  tlien  tokes  care  of  formatting 
the  message  and  sending  it  to  the  server. 

n 

3. 3. 2. 3  More  Sophisticated  Server  Impiementation 

A  more  realistic  implementation  of  a  server  process,  for  instance  a  file  server,  would  have  to  take  into 
accoiiBljjj  number  of  additional  considerations.  First,  data  is  not  always  available  immediately  on  a  GetData 
requdii^ln  many  cases,  it  has  to  be  fetched  from  the  disk  first.  Second,  it  is  unlikely  tliat  the  server  has 
sulTicient  bulTcr  space  to  read  the  whole  file  into  its  buffer  and  then  transfer  it  to  the  client  via  a  MoveTo.  A 
more  realistic  server  strategy  would  be  to  alUxralc  a  fixed-si/.e  buricr  in  which  it  could  fetch  a  blivk  from  the 
disk,  then  transfifr  that  block  to  die  client,  fetch  another  block  in  the  same  bulTer.  and  so  forth.  In  tliat  case, 
tlie  GetData  routine  would  look  as  in  Figure  3-5. 

I  ’inally,  it  is  often  Uie  case  tliat  a  small  amount  of  data  is  to  be  transferred  between  client  and  server,  for 
instance  a  single  file  page.  Such  applications  would  toke  advantage  of  tlie  capability  to  receive  such  a  small 
segment  together  with  a  message,  or  to  add  a  small  segment  to  a  Reply  message.  For  instance,  die  GetData 
routine  would  look  as  in  Figure  3-6. 

This  completes  the  description  of  die  V  interprocess  communication  primitives.  For  more  detail,  the  reader 
is  referred  to  (7).  We  now  turn  to  the  performance  evaluation  of  die  kernel.  First,  we  describe  in  detail  the 
experimental  environment  in  which  the  measurements  were  conducted  and  the  measurement  methods  used. 

3.4.  The  Experimental  Environment 

3.4.1.  Measurement  Methods 

Measurements  of  individual  operations  are  perfonned  by  executing  the  operation  N  times  (typically  10(X) 
times),  recording  the  total  time  required,  subtracting  loop  overhead  and  other  artifact,  and  then  dividing  the 
total  time  by  N.  Measurement  of  toLd  time  relics  on  die  software  maintoined  V  kernel  time  which  is  accurate 
plus  or  minus  10  milliseconds. 

Measurement  of  processor  utilization  is  done  using  a  low  priority  "busywork"  process  on  each  workstation 
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OataSarv«r() 


char 

Buf  f  er  [siazBuf  f  erS  Iza] ; 

/•  Data  buffer  •/ 

•**  -** 

Procassid 

pid; 

/*  Requestor  */ 

Massage 

nag: 

/*  Message  buffer  */ 

Request 

*raq  ■  (Request  *)  asg; 

/*  Buffer  for  roquost  */ 

Reply 

*rep  ■  (Reply  •)  asg; 

/*  Buffer  for  ropllas  */ 

1  ReplyCode 

reply; 

/•  Reply  code  •/ 

1  unsigned 

currentBuffarSIza; 

/*  Current  bufferalze  */ 

/•  Initlallza  •/ 
currentBuffarSIza  ■  0; 

/*  Register  as  the  data  sarver  */ 

SatP1d(  dataSarvar,  GetP1d(  actIveProcesa.  local),  aay  ); 

/*  Halt  for  raquosts  to  cosm  la  */ 

•th11a(  true  ) 

{ 

(  pid,  size  )  ■  Reca1va(  asg.  NULL,  0  ): 
sw1tch(  req->raquastcode  ) 

{ 

case  putOata: 

{ 

reply  •  PutOata(  pid,  roq  ); 
break; 

} 

ease  getOata: 

{ 

reply  ■  6et0ata(  pid,  req  ); 
break; 

) 

default: 

{ 

reply  ■  notlnpleaented; 
break; 

} 

) 

rep->rap1ycode  ■  reply; 

Rep1y(  fflsg.  pid,  NULL.  NULL.  0  ); 

) 


Figure  >2;  Server:  Main  Routine 
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RaplyCod*  PutOata(  pid,  raq  )  Procastid  pld;  Raquast  *raq; 

{ 

axtarn  char  Buffar[]: 

axtarn  untignad  currantBuffarSIza; 

1f(  raq->buffarS1za  >  naxBuffarSIza  ) 
raturn(  raquastTooBIg  ): 

MovaFroia(  pld,  Buffar,  raq->buffarPtr,  raq->buffarS1za  ); 
currantBuffarSIza  ■  raq->buffarS1za: 
raturn(  OK  ); 

) 

RaplyCoda  GatData(  pld,  raq  )  Procastid  pld:  Raquast  *raq; 

{ 

axtarn  char  Buffar[]: 

axtarn  untignad  currantBuffarSIza; 

1f(  raq->buffarS1za  >  currantBuffarSIza  ) 
raq->buffarS1za  ■  currantBuffarSIza: 

MovaTo(  pld,  raq->buffarPtr.  Buffar.  raq->buffarS1za  ): 

raturn(  OK  ); 

) 


Figure  >3:  Server:  Auxiliary  Routines 

CllantO 

{ 

Procastid  tarvarPId: 

Massaga  nsg: 

Raquast  *raq  ■  (Raquast  *)  lasg: 

unsignad  nyBuffarSIza; 

char  *ay8uffarPtr: 

/•  Allocata  a  buffar  •/ 

siyBuffarPtr  ■  tia11oc(  1.  NyBuffarSIza  ); 

/*  Locata  tha  sarvar  */ 

sarvarPId  ■  GatP1d(  dataSarvar,  any  ); 

/*  Do  a  PutOata  */ 

roq->roquostcodo  ■  putOata; 
roq->buffarPtr  ■  NyBuffarPtr; 
roq->bufforS1zo  ■  NyBuffarSIza: 

Sand(  raq,  sarvarPId  ); 

/*  Do  a  GotOata  */ 

roq->requostcodo  ■  gatOata; 
raq->buffarPtr  ■  NyBuffarPtr; 
raq->buffarS1za  ■  NyBuffarSIza; 

Sand(  raq,  tarvarPId  ); 

} 


Figure  3-4:  Client 

that  repeatedly  updates  a  counter  in  an  infinite  loop.  All  other  prwessor  utili/.ation  reduces  tlie  processor 
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ReplyCode  6atData(  pid.  raq  )  Procaatid  pld;  Raquatt  *raq; 

{ 

axtarn  char  Buffar[]: 

unsignad  1.  niwBlocka; 

/*  Assum  raquasts  ara  mUlplas  of  block  alza  */ 

nunBlocks  ■  roq->bufforS1zo  /  blockSIzo; 

/*  Transfor  data,  block  by  block  */ 

for(  1-0;  KnuaBlocka;  I-M'  ) 

{ 

RoadData0ffD1sk(  Buffer.  blockSIza  ): 

MovaTo(  pid,  raq->buffarPtr  *  1*blockS1za. 

Buffer,  blockSIza  ); 

} 

raturn(  OK  ): 

) 

Figure  3-5:  Using  Multiple  AfopeJo  operations 

RaplyCoda  GatOata(  pid,  raq  )  Procaisid  pid:  Roquast  *raq: 

{ 

axtarn  char  Buffar[]: 

Raad0ata0ff01sk(  Buffer,  pageSIza  ): 

Rap1y(  raq.  pid,  raq->buffarPtr,  Buffer,  pagaSIza  ): 

} 

Figure  >6:  Appending  Small  Segments  to  Messages 

alUxatiun  to  this  process.  Thus,  the  processor  time  used  per  operation  on  a  workstation  is  the  total  time 
minus  the  pnxrcssor  time  allocated  to  the  "busywork”  process  divided  by  N.  tlic  number  of  operations 
executed. 

Using  1000  trials  per  operation  and  time  accurate  to  plus  or  minus  10  millisa'onds.  our  results  should  be 
accurate  to  about  0.02  milliseconds  except  for  the  effect  of  variatitm  in  network  load.  Ihcse  variations  were 
observed  to  be  minimal  under  normal  eircumslanccs  (i.c.  when  no  piickels  are  lost).  Packet  loss  can  cause  a 
single  exchange  to  Uike  subsUintially  longer,  but  its  low  frequency  makes  the  effa't  invisible  in  the  calculation 
of  die  mean  values. 


3.4.2.  Hardware  Environment 

The  hardware  environment  in  which  the  experiments  are  to  take  place  was  described  earlier  (See  Section 
1.2.1).  In  summary,  the  system  mns  on  68000-based  SUN  workstations,  running  different  cltx:k  speeds  (8  and 
10  Mh/.),  the  latter  being  approximately  a  1  Mips  machine.  I'wo  networks  arc  used,  a  3  Mb  and  a  10  Mb 
Ktliernct.  However,  neither  the  processor  speed  nor  the  network  data  rate  fully  capture  the  potential  of  a 
particular  machine-network  configuration  with  respect  to  network  access.  In  the  next  section,  we  introduce 
the  notion  of  network  penalty,  which  provides  a  better  charactcri/alion  of  a  hardware  configuration  in  this 
respect. 

3.4.3.  Network  Penalty 

Our  measurements  of  tlic  V  kernel  arc  primarily  concerned  with  two  comparisons: 

•  The  cost  of  remote  operations  versus  the  cost  of  the  corresponding  local  operations. 

•  The  cost  of  remote  access  using  V  kernel  remote  operations  versus  tlic  cost  for  other  means  of  remote 
access. 

An  important  factor  in  both  comparisons  is  the  cost  imptiscd  by  network  communication.  In  the  second 
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comparison,  Ihc  basic  cost  of  moving  data  across  the  network  is  a  lower  bound  on  the  cost  for  any  network 
access  method.  In  tlie  first  comparison,  the  cost  of  a  remote  operation  should  ideally  be  the  cost  of  the  local 
operation  plus  die  cost  of  moving  data  across  the  network  (daui  that  is  in  shared  kernel  memory  in  the  local 
case)  minus  die  amount  of  concurrency  tliat  can  be  achieved  when  the  operation  is  executed  on  two  different 
machines.  For  instance,  a  l(Kal  message  S'em/ passes  pointers  to  shared  memory  buffers  and  descriptors  in  the 
kernel  while  a  remote  mes.siige  Send  must  move  the  same  data  across  the  network.  On  die  other  hand,  some 
wt)rk  on  the  sending  machine  (like  hkKking  the  sending  process  and  performing  a  process  switch,  if 
necessary)  can  be  done  in  parallel  with  the  receiving  machine  copying  the  message  packet  out  of  die  network 
interface  and  activating  die  receiving  process.  This  argument  also  assumes  that  die  necessary  provisions  for 
remote  communication  do  not  noticeably  degrade  the  performance  of  local  communication.  Although  not 
shown  here,  diis  assumption  is  correct  for  the  V  kernel. 

To  quantify  the  cost  of  network  communication,  we  define  a  measure  we  call  the  network  penalty.  The 
network  penalty  is  defined  to  be  the  time  to  transfer  N  bytes  from  one  workstation  to  another  in  a  network 
datagram  on  an  idle  network  and  assuming  no  errors.  'Ihe  network  penalty  is  a  function  of  the  processor,  the 
network,  the  network  interface  and  the  number  of  bytes  transferred.  It  is  the  minimal  time  penalty  for 
interposing  the  network  between  two  software  modules  that  could  otherwise  transmit  the  data  by  passing 
pointers.  The  network  penalty  is  obtained  by  measuring  the  time  to  transmit  N  bytes  from  the  main  memory 
of  one  workstation  to  the  main  memory  of  another  and  vice  versa,  and  dividing  the  elapsed  time  for  the 
experiment  by  2.  The  transfers  are  implemented  at  the  data  link  layer  and  interrupt  level  so  that  no  protocol 
or  prix;ess  switching  overhead  appears  in  die  results.  I  he  measurement  results  therefore  provide  an  accurate 
assessment  of  the  potential  of  a  particular  priKCSsor-nctwork  configuration  with  respect  to  network 
communication. 

Let  us  first  consider  data  transfers  diat  fit  within  a  .single  network  packet'*.  Measurements  of  network 
penalty  were  made  using  die  3  Mb  Kthernet  and  the  10  Mb  Hdiernet.  In  all  measurements,  the  network  was 
essentially  idle  due  to  the  unsociable  times  at  which  measurements  were  made.  Ilie  assumption  of  an  idle 
network  is  consistent  with  die  utilization  of  most  Kical  networks.  For  instance,  our  network  averages  1  to  2 
percent  utilization,  fable  3-1  lists  our  measurements  of  the  3  Mb  network  penalty  tor  the  SUN  workstation 
using  die  S  and  10  MHz  processors,  and  Table  3-2  provides  the  results  for  ;i  10  Mb  l-thcrnet,  using  die  3-Com 
interlace  1 1 1.  I  he  network  time  column  gives  the  time  for  die  data  to  be  transmitted  based  on  the  physical  bit 
r.ite  of  the  mcdiinn,  namely  3  or  10  Mb. 


Single  Packet  Network  Penalty  --  3  Mb 


Hylcs 

Network  Time 

Network  Penally 
8  MHz.  10  MHz 

64 

0.17 

0.80 

0.65 

128 

0.35 

1.20 

0.96 

256 

0.70 

2.00 

1.62 

512 

1.39 

3.65 

3.00 

1024 

2.78 

6.95 

5.83 

I’ahlc  .3-1:  3  Mb  FTlicrnct  Sun  Workstation  Network  Penalty  (times  in  msec.) 

I  hc  din'crcnce  between  the  network  time,  computed  at  the  network  data  rale,  and  the  measured  network 
penally  time  is  accounted  for  primarily  by  die  prrxressor  time  to  gcncr.itc  and  ir.insmii  the  packet  and  then 
receive  the  packet  at  the  other  end.  F'or  instance,  for  a  1024  byte  packet,  using  an  8  MHz  processor  and  die  3 
Mb  network,  the  copy  time  from  memory  to  the  Fltlicrnct  interface  and  vice  versa  is  roughly  1.90  milliseconds 
in  each  direction.  I'hus.  of  the  total  ().95  milliseconds  network  penalty  on  the  3  Mb  network.  3.80  is  copy 
time,  2.78  is  network  transmission  lime  and  0.3  is  (presumably)  network  and  intcrf.icc  l.itcncy.  If  we  consider 
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A  niimt'cr  of  pr.irl  ir.il  conMilcMlions  impose  .1  ncnimum  paikcl  si/c  of  about  t024  b>lcs  on  our  1  Mb  network  Pic  nKomuim  packet 
si/c  on  the  10  Mb  l•thcrncl  rs  1536  bytes. 
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Single  Packet  Network  Penalty  - 

10Mb 

Bytes 

Network  Time 

Network  Penalty 

8  MHz 

10  MHz 

64 

0.05 

0.73 

0.55 

128 

0.10 

1.00 

0.80 

256 

0.20 

1.63 

1.29 

512 

0.41 

2.86 

2.29 

1024 

0.82 

5.13 

4.26 

Tabic  3-2:  10  Mb  Htlicrnct  SUN  Workstation  Network  Penalty  (times  in  msec.) 

a  10  Mb  Ethernet,  a  similar  argument  holds,  making  the  processor  time  approximately  75  percent  of  the 
overall  network  penalty.  The  importance  of  the  processor  speed  is  also  illustrated  by  the  difference  in 
network  penalty  for  the  two  processors  measured  in  Tables  3-1  and  3-2. 

l.et  us  now  consider  the  case  where  tlic  data  transfer  requires  multiple  packets  to  be  sent  from  the  sender  to 
the  receiver.  The  naive  approach  for  obtaining  the  network  penalty  for  such  a  multi-packet  data  transfer 
would  be  to  sum  the  network  penalties  for  the  individual  packets.  If  one  wished  to  experimentally  measure 
tliis  quantity,  for  insutnee  for  a  4-packet  transfer,  an  experiment  would  be  set  up  as  depicted  on  tlie  left  hand 
side  of  I'igurc  3-7  (labeled  "non-streamed  transfer").  A  packet  is  sent  from  the  sender  to  the  receiver;  a 
packet  is  returned  from  the  receiver  to  the  sender;  and  this  pr(x:edure  is  repeated  4  times.  I'he  elapsed  time  is 
measured  and  divided  by  two  in  order  to  obtain  tlie  network  penalty. 


w 

w 

- ^ 

w 

w 

w 

w 

^ - 

^ - 

'I 

time 

r 

^ - 

^  non-streamed  transfer _ streamed  transfer _  I 


Figure  3-7:  Packet  frame  for  Network  Penalty  Measurements 

for  Nt)n-Streamcd  and  Streamed  Multi-Packet  Transfers 

lliis  is  however  not  the  implementation  of  choice  ft<r  multi-packet  transfers  on  a  local-area  network.  Due 
to  the  low  latency  of  such  a  network,  multi-packet  transfers  arc  implemented  more  cHicicntly  in  "streamed 
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mode",  as  suggested  by  the  right  hand  side  of  Figure  3-7.  All  necessary  packets  are  transferred  from  the 
sender  to  the  receiver  and  tlien  an  equal  number  of  packets  is  returned  from  the  receiver  to  the  sender.  When 
the  elapsed  time  for  this  operation  is  measured  and  divided  by  two.  a  significantly  lower  value  for  the  network 
penalty  is  obtained.  The  reason  for  this  difference  is  explained  in  Figure  3-8.  llic  top  line  in  tliis  figure 
corresponds  to  the  transfer  in  non-streamed  mode,  while  tlie  bottom  line  corresponds  to  the  streamed 
transfer.  The  time  axis  runs  horizontally  from  left  to  right  and  the  example  is  for  the  case  of  the  transfer 
requiring  two  packets  in  each  direction. 


Figure  3-8:  Advantages  of  Streamed  Multi-Packet  Transfers 

Consider  first  the  chronological  sequence  of  events  in  the  case  of  non-streamed  transfer,  nie  sending 
prcKessor  copies  a  packet  fn)m  main  memory  to  its  interface  and  then  the  interface  puts  tlie  packet  on  the 
network.  After  a  time  period  equal  to  the  network  pn*pagation  delay  die  packet  arrives  at  the  receiver’s 
interface  and  then  it  is  LT)pied  from  the  receiver’s  interface  imo  the  rexeiver's  incmoiy  by  the  receiving 
prrxessor.  I’his  process  repeats  itself  in  tlie  reveisc  direction  for  the  packet  that  is  returned  from  die  receiver 
to  die  sender,  and  dicn  again  for  the  next  packet,  and  so  forth.  Note  diat  die  two  processors  arc  never  active 
in  parallel.  This  is  not  the  case  when  the  transfer  is  done  in  streamed  mode,  as  shown  on  die  second  line  of 
Figure  3-8.  Due  to  the  very  low  propagation  delay  of  a  local  network,  the  packet  is  received  in  the  receiver’s 
interface  almost  completely  concurrently  with  the  sender's  interface  transferring  it  over  the  network*. 
Iherefore  the  pr(K'es.sor  on  the  sending  machine  can  start  copying  the  next  packet  from  memory  to  the 

*In  fact,  ihc  propagaiion  delay  is  Tar  exaggerated  in  llgure  3-8  to  make  il  visible  at  all:  typical  profiagaiion  delays  on  a  local  network 
arc  on  the  order  of  10  microseconds  while  Ihc  copy  and  transmission  times  depicted  in  I  igurc  3-8  arc  on  the  order  of  1  millisecond. 
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interface  in  parallel  with  the  prcxrcssor  on  the  receiving  machine  copying  the  previous  packet  out  of  its 
interface  into  its  memory.  Due  to  the  fact  that  these  copies  happen  in  parallel,  and.  as  we  saw  before,  the 
copy  times  contribute  significantly  towards  the  overall  elapsed  time,  streamed  transfer  results  in  values  of  the 
network  penalty  tltat  arc  substantially  lower  than  those  obtained  for  non-streamed  transfers.  Measurement 
results  for  the  network  penalty  of  multi-packet  transfers  in  streamed  mode  are  reported  in  Table  3-3  and  3-4. 
Hie  network  penalty  for  non-streamed  transfer  can  be  computed  by  simply  multiplying  the  figures  in  Table 
3- 1  and  3-2  by  die  appropriate  factors. 

Multi-Packet  Network  Penalty 

Bytes  Network  Time  Network  Penalty 

_ 8  MH/.  10  MHy. 

2048  5.57  11.64  9.96 

4096  11.14  21.75  18.95 

Table  >3:  3  Mb  Rthemet  Sun  Workstation  Network  Penalty  (times  in  msec.) 


Bytes  Network  Time  Network  Penalty 

_ 8  MH/  10  MHy 

2048  1.64  7.72  6.04 

4096  3.28  13.90  11.11 

Table  3-4:  10  Mb  F.thernet  Sun  Workstation  Network  Penalty  (limes  in  msec.) 

Using  the  10  Mh/.  priKCSSor  and  4  kilobytes  per  data  transfer,  these  figures  indicate  effective  data  rates  of 
1.7  Mb  and  3.4  Mb  respectively,  far  below  the  advertised  network  data  rates  of  3  Mb  and  10  Mb.  Kven  if  we 
were  to  let  the  si/e  of  the  data  transfer  become  arbitrarily  large,  the  effective  data  rale  would  still  be  limited  by 
die  inverse  of  the  sum  of  the  packet  copy  time  plus  the  packet  transmission  time,  and  ihercforc  be  noticeably 
inferior  K  die  network  daui  rate.  Higher  ihrougliput  values  can  be  achieved  if  die  interface  provides  double 
buffering  In  that  case,  if  the  processor  can  fill  the  bulTcr  at  least  as  fast  as  die  interface  can  write  it  to  the 
network,  the  interface  will  be  pennancntly  busy,  achieving,  at  least  in  theory,  a  lliroughpui  comparable  to  the 
data  rate  of  the  network.  In  practice,  die  throughput  would  probably  be  somewhat  lower  due  to  device 
latency  and  network  contention. 


3.5.  Interprocess  Communication  Performance 

Our  first  set  of  kernel  measurements  focuses  on  the  speed  of  local  and  network  interprocess 
communication.  The  kernel  performance  is  presented  in  terms  of  the  times  for  message  exchanges  and  the 
data  transfer  operations. 

3.5.1 .  Kernel  Measurements 

fable  .t-5  gives  die  results  of  our  measurements  of  mcss.'igc  exchanges  and  data  transfer  with  the  kernel 
rumiiiig  on  worksUilions  using  an  8  MHz  pnxrcssor  and  connected  by  a  3  Mb  ITIiernet.  The  columns  labeled 
local  and  Remote  give  the  elapsed  limes  for  diesc  operations  executed  liKally  and  remotely.  The  Difference 
column  lists  the  time  dilTerence  between  the  liKal  and  remote  operations.  The  Renaliy  column  gives  the 
network  penalty  for  the  amount  of  daut  transmitted  as  part  of  the  remote  operation.  'Hie  Client  and  Server 
columns  list  die  processor  time  used  for  the  operations  on  the  two  machines  involved  in  the  remote  execution 
of  the  operation.  Table  3-6  gives  the  same  measurements  using  a  10  MHz  processor  and  Tables  3-7  and 
3-8  refer  to  measurements  with  the  10  Mb  network  and  the  same  processors.  The  limes  for  both  prtKcssors 
are  given  to  indicate  the  effect  the  processor  speed  has  on  local  and  remote  oper.ition  performance.  As 
expected,  the  times  for  local  operations,  being  dependent  only  on  the  prix.'essor  speed,  are  25  percent  faster  on 
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Kernel  Performance 


Kernel  Operation 


Elapsed  Time 


I^x:al  Remote 


Network 

Penalty 


Processor  Time 


Client 


Server 


Send-Receive-Reply  1.13  i.lo  z.Ud  t 

MoveP'rom:  1  Kbyte  1.35  9.03  7.68  8.15  3.76 

Move'l'o:  1  Kbyte  1.34  9.05  7.71  8.15  3.59 

Tabic  3-5;  Kernel;  3  Mb  Ethernet  and  8  MHi  Processor  (times  in  msec.) 


Scnd-Rcccivc-Rcply 

0.94 

2.54 

1.60 

1.30 

1.44 

MoveP'rom;  1  Kbyte 

1.19 

8.00 

6.81 

6.77 

3.32 

Move'To:  1  Kbyte 

1.19 

8.00 

6.81 

6.77 

3.17 

Table  3-6:  Kernel;  3  Mb  lUhernet  and  10  MH/.  Proces.sor  (times  in  msec.) 


Send-Rcccivc-Reply 

1.13 

2.68 

1.55 

1.46 

1.59 

MoveP'rom:  1  Kbyte 

1.35 

6.52 

5.17 

6.27 

3.10 

Move'To:  1  Kbyte 

1.34 

6.51 

5.17 

6.27 

3.19 

Table  >7;  Kernel:  10  Mb  llthernet  and  8  MHz  Processtrr  (times  in  msec.) 


Send-Rcccive-Rcply 

0.94 

2.23 

1.29 

1.10 

1.30 

MoveP'rom;  1  Kbyte 

1.19 

5.83 

4.64 

5.16 

2.38 

MoveTo:  1  Kbyte 

1.19 

5.86 

4.67 

5.16 

2.49 

Table  3-8;  Kernel;  10  Mb  P.tlKrnet  and  10  MHz  Processor  (times  in  msec.) 
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the  25  percent  faster  prcKcssor.  The  almost  15  percent  improvement  for  remote  operations  indicates  the 
prtxressor  speed  is  a  significant  performance  f^actor  for  remote  communication  and  is  not  rendered 
insignificant  by  the  network  delay. 

3.5.2.  Interpreting  the  Measurements 

\  number  of  interesting  observations  can  be  made  based  on  tltese  figures.  First,  it  can  be  seen  that  a 
significant  level  of  concurrent  execution  takes  pl.icc  between  workstations  even  tliough  the  message-passing  is 
fully  synchronized.  For  instance,  transmitting  the  packet,  blocking  die  sender  and  switching  to  another 
process  on  the  sending  workstation  proceeds  in  parallel  with  the  reception  (»f  the  packet  and  the  readying  of 
tlie  receiving  priKCss  on  the  receiv  ing  workstation.  Concurrent  execution  is  indicated  by  the  fact  that  tlie  total 
of  the  server  and  client  processor  times  is  greater  than  die  elapsed  time  for  a  remote  message  exchange  (See 
the  (  hem  ,md  .VcM  cr  cailumns  in  tlic  above  tables).  In  particular,  for  the  10  Mb  network  and  the  10  Mhz 
processor,  the  elapsed  time  of  2.23  milliseconds  has  to  be  compared  to  die  sum  of  the  client  and  server 
prrKessoi  utilizations  which  amounts  to  3.04  milliseconds,  indicating  diat  both  processors  are  concurrently 
active  during  36  percent  of  the  elapsed  time. 

Second,  we  have  argued  before  that  a  good  lower  bound  on  the  remote  message  time  is  die  sum  of  the  local 
mcss,igc  time  and  the  network  penalty  minus  die  concurrency  observed  between  the  two  processors.  In  fact, 
the  remote  message  time  is  .ihout  1  millisecond  higher  than  diis  lower  bound  (2.2  milliseconds  vs.  a  lower 
bound  of  1.2  milliseconds).  I  his  discrejiancy  is  not  unexpected  and  is  primarily  accounted  for  by  the  need  to 
dynamically  allocate  a  bullcr  for  the  incoming  message  on  the  receiving  machine.  Specifically,  for  local 
mcssiigc  p  issing,  no  dv  namic  buffer  allocation  is  ncccss,iry  because  the  message  can  be  buffered  in  the  process 
descriptor  of  die  sender.  When  a  message  comes  in  over  the  network,  a  buffer  has  to  be  allocated 
dynamically.  Similarly,  when  the  Reply  is  sent,  diat  buflcr  has  to  be  dealliKatcd.  A  newer  version  of  the 
kernel  has  at  all  times  such  a  buflcr  preallocatcd  in  order  to  reduce  this  cost.  Additional  costs  that  are 
incurred  on  top  of  diosc  accounted  for  in  the  c.ilciilaiion  of  die  lower  bound  are  the  need  for  locating  die  host 
of  the  remote  process  .ind  the  tact  diat  in  the  kernel  the  Fthernet  device  is  shared  between  the  kernel  and 
applic.ition  programs. 

As  a  final  observation,  we  claim  th.it  the  absolute  dilTerencc  between  KkuiI  and  remote  message  times  is 
sufficieiitl'.  sm.ill  for  it  to  be  possible  to  intermix  loc.il  and  remote  message  exchanges  rather  freely.  Some 
c.ire  is  rceiuired  in  inteipreling  this  st.iteiiieni.  Superfici.illy,  the  lad  tli.it  the  remote  Scnd-Receive-Rcply 
sequence  (for  the  10  Mb  neiwauk)  takes  twice  as  long  .is  for  the  lot  c.ise  would  seem  to  suggest  that 
distributed  .ipplic.itions  should  be  designed  to  minimize  inter-machine  communication.  In  general,  one 
might  ci  nsider  it  impr.ictic.il  to  view  interprocess  comiiiuiiicatioii  as  transparent  across  machines  when  tlie 
speed  ra'io  is  that  large.  However,  .i  more  re.ilistic  interpretation  is  to  recognize  Uiat  the  remote  operation 
adds  a  delay  of  about  one  millisecond,  and  that  in  many  cases  diis  time  is  insignificant  relative  to  tlie  time 
ncccss;iry  to  process  a  request  in  die  server.  Furthermore,  the  sending  or  client  workstation  processor  is  busy 
with  die  remote  Send  for  only  1.30  milliseconds  out  of  die  total  2.23  millisecond  time  (using  the  10  MHz 
priKCssoi ).  I  hus.  one  c.in  oflload  the  processor  on  one  m.ichine  by.  for  instance,  moving  a  server  process  to 
another  m.ichine  if  its  request  processing  gener.illy  requires  more  than  0.3()  milliseconds  of  prticessor  lime,  i.e. 
the  ditVc’eiice  between  die  loc.il  Send-Reccive-Reply  time  and  die  local  processor  time  for  the  remote 
operation. 

Ihc  above  observations  icl.ile  primarily  to  the  Send-Reeeive- Reply  sequence.  A  number  of  similar 
arguments  can  be  made  for  the  MoveTrom  .md  MovcTo  primitives,  although  some  additional  considerations 
have  to  be  taken  into  .iccount.  First,  m  older  to  esuiblish  a  reasonable  lower  bound  of  die  cost  of  a  remote 
MoveFrom.  it  is  not  quite  accurate  t(i  take  the  sum  of  the  local  time  plus  the  network  penalty  minus  the 
amoiml  ol  concurrent  execution.  Hie  cost  of  a  local  copy  h.is  to  be  sublraclcd  from  this  quantity  in  order  to 
establish  <in  accur.itc  lower  bound.  Indeed,  unlike  for  a  Send  opcr.ition,  where  a  local  copy  of  the  message  is 
madc^  no  such  copy  is  done  for  MoveTo  and  MoveFrom  operations.  On  a  10  Mb  network  and  a  10  Mhz 


Ihc  cosi  of  this  copy  is  sm.ill  ,iny  ways,  given  the  small  size  of  the  mcs.s.ige 
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nil-  V  kf.rni:l  primitivizs  and  iiifir  pi;ri  ormanci- 


processor,  this  argument  leads  to  a  lower  bound  of  5.06  milliseconds  (  The  copy  time  is  approximately  0.70 
milliseconds.)  This  figure  has  to  be  compared  with  tlie  experimentally  measured  time  of  5.83  milliseconds. 
Relatively  speaking,  the  experimental  numbers  for  tlic  MoveFrom  operation  are  much  closer  to  the  lower 
bound  than  the  numbers  obtained  for  the  Send-Receive~Reply  sequence.  This  is  explained  by  the  fact  that  the 
MoveFmm  operation  docs  not  require  any  dynamic  buffering,  since  by  its  definition,  buffers  arc  available 
both  in  the  process  tliat  is  executing  tlic  MoveFrom  as  well  as  in  the  tirgct  prtKCss.  Additionally,  it  can  be 
observed  that,  relatively  speaking,  the  amount  of  concurrent  execution  for  MoveFrom  operations  is  much 
lower  than  for  the  Send-Receive-Reply  sequence  (10  percent  vs.  36  percent).  I’his  again  indicates  the 
dominant  cost  of  the  actual  data  transfer  in  the  overall  cost  of  the  MoveFrom  operation,  a  cost  which  cannot 
be  decreased  by  concurrency,  since  its  basic  constituents,  tlic  sender  copy  into  the  network  interface,  the 
transmis.sion  and  the  receiver  copy  out  of  the  network  interface  necessarily  happen  in  scries. 


3.5.3.  Comparison  withOther  Results 

Comparison  with  other  experimental  results  is  always  extremely  difficult,  given  the  subtle  variations  in  the 
semantics  of  the  primitives,  and  differences  in  hardware  and  measurement  methods.  The  figures  in  this 
section  arc  thus  to  be  taken  more  as  ballpark  figures  ratlicr  than  as  exact  comparisons. 

A  V  message  exchange  is  very  similar  to  Leblanc’s  synchronous  port  rrt//(46).  Leblanc  estimates  that  on  a  1 
Mil's  pRx:cssor  and  a  10  Mb  network  (similar  to  our  prototype  environment),  a  synchronous  port  call  with  2 
bytes  of  data  would  take  3.43  milliseconds.  It  is  probably  fair  to  assume  that  tliis  cost  includes  a  sizable 
constant  term,  independent  of  the  amount  of  daui  in  the  call,  and  a  second  term,  linear  in  tlic  number  of  data 
bytes.  Thus,  one  might  expect  synchronous  port  calls  with  32  bytes,  equal  to  the  amount  of  data  in  V 
mcsjuigcs.  to  take  slightly  more  than  3.43  milliseconds.  This  result  is  to  be  compared  with  2.23  milliseconds 
for  a  V  message  exchange. 

In  his  thesis.  Spccior  reports  that  a  remote  memory  reference  of  16  bits  between  two  68000s  connected  by  a 
3  Mb  network  uikcs  152  microseconds,  under  favorable  circumsUinccs  (71).  'flic  amount  of  data  transferred  in 
a  V  message  exchange  would  require  32  such  remote  references  (32  bytes  for  both  tlic  Send  and  die  Reply), 
resulting  in  an  overall  cost  of  approximately  4.8  milliseconds. 

Linally.  a  Cedar  remote  pKKCdiiic  call  with  8  arguments  and  8  results  is  reported  to  take  1.25  milliseconds 
between  two  Doi  .idos  on  a  3  Mb  I  'thci  nct  network  |9|.  About  0.25  milliseconds  of  that  time  is  accounted  for 
by  network  transmission,  and  thus  approximately  1  millisecond  is  spent  on  the  processors.  Since  a  Dorado  is 
an  order  of  magnitude  faster  ilian  a  68000,  we  might  estimate  approximately  10  milliseconds  for  a  comparable 
operation  on  68000.  Of  course,  an  8  argument.  8  result  Mesa  pnx;cdurc  call  entails  substantially  more 
functionality  than  a  V  message  exchange. 

3.5.4.  Multi-ProcessTraffic 

'I’he  discussion  so  far  has  focused  on  a  single  pair  of  processes  communicating  over  die  network.  In  reality, 
processes  on  several  workstations  would  be  using  the  network  concurrently  to  communicate  with  other 
prrKesses.  Also,  server  processes  woidd  be  accessed  concurrently  by  many  other  prrKesses.  Some 
investigation  is  called  for  to  dctcnninc  how  much  message  traffic  the  network  can  support,  and  what 
degradation  in  response  time  is  to  he  expected  as  a  result  of  other  network  or  server  load. 

First,  let  us  consider  die  elTccts  of  network  load.  A  pair  of  workstations  communicating  via 
Send- Receive- Reply  at  maximum  speed  generates  a  loa*l  on  die  network  of  about  400,000  bits  per  second, 
about  13  percent  of  a  3  Mb  IZdicrnct  and  4  percent  of  a  10  Mb  l•,thcrnct.  Measurements  on  die  10  Mb 
F.lheriiet  indicate  that  for  die  packet  si/e  in  question  no  significant  network  delays  arc  to  be  expected  for  loads 
up  to  25  percent  (30].  I  hus,  one  would  expect  minimal  degradation  with  s.iy  two  separate  pairs  of 
workstatums  communicating  on  the  s.mie  network  in  this  fashion.  Unfortunately,  our  measurements  of  diis 
sccn.irio  turned  up  .i  hardware  problem  m  our  3  Mb  I•'tl1ernct  interface,  which  causes  many  collisions  to  go 
undetected  .md  show  up  as  con  upted  p.ivkcis,  I  lie  response  time  for  the  8  MHz  poKcssor  workstation  in  diis 
ease  is  3.4  milliseconds.  Hie  incrc.ise  in  time  from  3.18  milliseconds  is  accounted  for  almost  entirely  from  the 
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timeouts  and  retransmissions  arising  from  this  hardware  problem  (roughly  one  retransmission  per  20(X) 
packets).  With  corrected  network  interfaces,  wc  estimate  tliat  tlic  network  can  support  any  reasonable  level  of 
message  communication  without,  significant  performance  degradation.  Similar  measurements  could  not  be 
conducted  on  tlie  10  Mb  network  due  to  the  limited  number  of  connected  machines  at  llie  time  of  writing. 

A  more  critical  resource  is  prtKCssor  time.  1'his  is  especially  true  for  machines  such  as  servers  tliat  tend  to 
be  the  fiKus  of  a  significant  amount  of  mcss;igc  traffic.  For  instincc,  just  based  on  server  prcxressor  time,  a 
worksuuion  is  limited  to  at  most  about  575  messiige  exchanges  per  second,  independent  of  the  number  of 
clients.  The  number  is  substantially  lower  for  file  access  operations,  particularly  when  a  realistic  figure  for  file 
server  prixessing  is  included.  File  access  measurements  arc  examined  in  tlie  next  section.  i 


3.6.  File  Access  Using  the  V  Kernel 

Although  it  is  attractive  to  consider  the  kernel  as  simply  providing  message  communication,  the 
predominant  use  of  message  communication  is  to  provide  file  access,  especially  in  our  environment  of  diskless 
personal  workstations.  File  access  takes  place  in  several  different  fonns:  random  file  page  access,  sequential 
file  access  and  program  loading.  In  this  chapter  wc  restrict  ourselves  to  random  page-level  access  and 
program  loading.  Sequential  file  access  is  studied  in  detail  in  the  next  chapter.  Wc  assume  tliat  the  file  server 
is  dedicated  to  serving  the  client  process  wc  arc  measuring  and  otherwise  idle.  We  first  describe  the 
performance  of  random  page-level  file  access. 

3.6.1 .  Random  Page-Level  File  Access 

Tables  3-9  and  3-10  list  the  times  for  reading  or  writing  a  512  byte  block  between  two  prtKesses  using  tlie  10 
MHz  processor,  interconnected  by  a  3  Mb  or  a  10  Mb  litlicrnci’  .  I'hc  ctilumns  arc  to  be  interpreted 
according  to  the  explanation  given  for  similarly  labeled  columns  of  fables  3-5  to  3-8.  I  he  times  do  not 
include  the  time  to  fetch  the  data  from  disk  but  indicate  expected  performance  when  data  is  buffered  in 
memoi  v.  A  page  read  involves  the  sequence  of  kernel  operations:  Semi-Receive-Kcply.  whereby  the  page  is 
appended  to  die  Reply.  A  page  write  involves  a  Send-Receive-Reply,  where  again  the  page  is  appended,  Uiis 
time  to  the  Send. 

I  hei  e  are  several  considerations  Uiat  compensate  for  the  cost  of  remote  operations  being  higher  than  Uxral 
operations  (Some  arc  special  cases  of  those  described  for  simple  message  exchanges.)  First,  die  extra  2.9 
millisecond  time  for  remote  operations  is  relatively  small  compared  to  the  time  cost  of  the  file  system 
operation  itself.  In  particular,  disk  access  time  can  be  estimated  at  20  milliseconds  (assuming  minimal 
seeking)  and  file  system  processor  time  at  2.5  milliseconds**,  'fhis  gives  a  ItKal  file  read  time  of  24.2 
milliseconds  and  a  remote  time  of  27  milliseconds,  making  die  cost  of  die  remote  operation  only  12  percent 
more  than  die  local  operation. 


Random  Page-Level  Access 

Kernel  Operation  F.lapscd  Time  Network  Processor  l  ime 

Penalty 


7 

I'rom  this  point,  wc  present  only  the  figures  for  (he  10  Mhz  processor. 

R. 

Ill  IS  IS  b.nscd  on  mcasurcmcnis  of  Ixx'US  [59]  lh:il  give  6.2  .nnd  4  3  niiltiscconds  as  processor  lime  costs  for  M2-bylc  Tile  rc.id  and  write 
opcraiioiis  respectively  on  a  Pdf  11/45,  which  is  roughly  half  the  speed  of  the  10  Mllr  Motorola  (iROOfl  proccs.sor  used  in  the  Sun 
workstation. 
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Operation 

Local 

Remote 

□iff. 

Client 

Server 

page  read 

1.69 

5.56 

3.87 

3.89 

2.50 

3.28 

page  write 

1.68 

5.60 

3.94 

3.89 

2.58 

3.32 

Table  >9:  Page-I  .evcl  File 

Access:  512  byte  pages  (times 

in  msec.) 

page  rc<td 

1.69 

4.54 

185 

3.15 

2.41 

3.08 

page  write 

1.68 

4.54 

2.86 

3.15 

2.38 

3.10 

Table  3-10:  Pagc-I.cvcl  File  Access:  512  byte  pages  (times  in  msec.) 

This  comparison  assumes  that  a  local  file  system  workstation  is  ilic  same  speed  as  a  dedicated  file  server.  In 
reality,  a  shared  file  server  is  often  faster  because  of  the  faster  disks  and  more  memory  for  disk  caching  that 
come  with  economy  of  scale.  If  the  average  disk  access  time  for  a  file  server  is  2.8  milliseconds  less  than  the 
average  local  disk  access  time  (or  better),  there  is  no  time  penalty  (and  possibly  some  advantage)  for  remote 
file  operations. 

Second,  remote  file  access  offloads  the  workstation  processor  if  the  file  system  prcKessing  overhead  per 
request  is  greater  the  difference  between  the  client  processor  time  for  remote  page  access  and  for  Ux:al  page 
access,  namely  0.7  milliseconds.  A  proccs-sor  cost  of  more  than  0.7  milliseconds  per  request  can  be  expected 
from  the  estimation  made  earlier  using  l.0C'US  figures. 

Ihcsc  measurements  indicate  the  performance  when  file  reading  and  writing  use  explicit  segment 
specification  in  tlic  mcs.s,igc  and  the  kernel  appends  die  segments  appropriately.  However,  a  file  write  can 
also  he  performed  in  a  more  basic  Thoth-like  way  using  die  Send-Receive-MoveFrom-Reply^qucncc.  For  a 
512  byte  write  on  a  10  Mb  network,  this  costs  7.0  milliseconds.  I’hus,  the  segment  ma'hanism  saves  2.5 
milliseconds  on  every  page  read  and  write  operation,  justifying  diis  extension  to  tlic  messttge  primitives.  A 
cave.it  needs  to  be  made  about  the  benefit  of  being  able  to  receive  the  segment  at  the  stime  time  as  the 
mesvige  IS  being  received.  In  the  current  implcmcnuition  the  segment  gets  dropped  when  the  receiver  is  not 
w.iiting  to  receive  a  message  at  the  point  die  network  packet  arrives.  Ihis  is  more  likely  to  happen  under 
conditions  of  increasing  load,  and  could  thus  deteriorate  the  performance  to  the  figure  mentioned  above  for 
the  Scnd-Rcceiye-Move/'rom-Repfy  sequence.  When  a  demand  paging  becomes  available,  we  hope  to  avoid 
this  problem  h.  always  accepting  the  incoming  segment  and  mapping  it  into  the  receiver's  address  sjiace  at 
the  time  the  rtceivei  performs  die  next  Receive. 

3.6.2.  Program  Loading 

Program  loading  differs  as  a  file  access  .activity  from  page-level  access  in  diat  die  entire  file  containing  the 
program  (or  most  of  it)  is  to  he  transferred  .as  quickly  .as  possible  into  a  waiting  program  execution  sp.ace.  For 
instance,  a  simple  command  interpreter  we  have  written  to  run  with  the  V  kernel  loads  programs  in  two  read 
operations:  the  first  read  accesses  die  program  header  informtition;  the  second  read  copies  the  program  code 
and  data  into  the  newly  created  program  space.  The  time  for  die  first  read  is  just  the  single  bltKk  remote  read 
time  given  earlier.  The  second  read,  generally  consisting  of  several  tens  of  disk  p.igcs,  uses  MovcToio  transfer 
the  daui.  llecausc  MovcTo  requires  that  die  data  be  stored  contiguously  in  memory,  it  is  often  convenient  to 
implement  a  large  re.id  as  multiple  A/mrT'o  operations.  For  instance,  our  current  Vax  file  server  breaks  large 
rc.id  and  write  opcr.itions  into  MovcTo  and  MoveTrom  operations  of  .it  most  4  kilohy  ios  at  a  lime.  Tables 
3-11  and  3-12  give  the  time  to  transfer  (4  kilobytes  between  priKesscs  (  The  elapsed  time  for  file  writing  is 
basically  the  s.imc  as  for  reading  and  has  been  omitted  for  die  sake  of  brevity.)  The  transfer  unit  is  die 
amount  of  data  transferred  per  MoveTo  operation  in  satisfying  die  read  request. 

The  times  given  for  program  loading  on  die  3  Mb  Hthcrnet  using  a  16  or  64  kilobyte  transfer  unit 
corresponds  to  a  data  rale  of  about  192  kilobytes  per  sectind.  which  is  within  12  percent  of  die  data  rate  we 
can  achieve  on  a  Si  n  workstation  by  simply  writing  packets  to  die  nciwurk  iinerfacc  as  rapidly  as  possible. 
Moreover,  if  the  file  server  retained  copies  of  frequently  used  programs  in  memory,  much  as  many  current 
timesharing  systems  do,  program  loading  could  achieve  die  performance  given  in  the  t.ible.  independent  of 
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disk  speed.  Thus,  we  argue  that  MoveTo  and  MoveFrom  with  large  transfer  units  provide  an  cfTicicnt 
program  loading  mechanism  that  is  almost  as  fast  as  can  be  achieved  with  the  given  hardware. 

Program  Loading  -  3  Mb 

Kernel  Operation  Elapsed  Time  Network  Processor  Time 

Penalty 


Transfer  unit 

L(x:al 

Remote 

Difference 

Client 

Server 

1  Kb 

71.7 

518.3 

446.5 

434.5 

207.1 

297.9 

4  Kb 

62.5 

368.4 

305.8 

•9 

176.1 

225.2 

16  Kb 

60.2 

344.6 

284.3 

« 

170.0 

216.9 

64  Kb 

59.7 

335.4 

275.1 

* 

168.1 

212.7 

Tabic  3-1 1 :  64  Kilobyte  Remote  Read  (times  in  msec.) 
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Not  available  from  measurement. 
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Kernel  Operation 
Transfer  unit  Local 


Program  Loading  ••  10  Mb 

Hlapsed  Time 


Network 

Penalty 


Remote  Difference 


Processor  Time 
Client  Server 


Table  3-12:  64  Kilobyte  Remote  Read  (times  in  msec.) 


3.7.  Pipes 


3.7.1 .  Introduction 


One  of  the  principal  claims  of  this  thesis  is  the  utility  of  a  general  interprcKCSs  c(»mmunication  mechanism 
as  a  communication  substrate  for  a  distributed  system.  The  main  advantage  of  interprocess  communication 
mechanisms  is  their  generality,  allowing  other  protocols  to  be  built  on  top  in  a  convenient  fa.shi»)n.  With 
special-purpose  protocols,  a  new  protocol  has  to  be  devised  from  the  ground  up  for  every  new  application. 
However,  if  proper  care  is  not  taken,  the  extra  layer  of  prouxrol  can  cause  significant  performance 
degradation. 


In  Section  3.6  we  refuted  die  notion  that  the  use  of  interprocess  communication  as  a  base  for  file  access 
results  in  unsatisfactory  file  access  performance.  We  continue  the  discussion  of  file-access  performance  in 
Cliaptcr  4.  In  order  to  fully  substantiate  our  claim  about  the  usefulness  of  interprocess  communication,  we 
still  need  to  demonstrate  dial  other  prot(K:ols  (other  titan  file  access)  can  be  built  conveniently  and  efficiently 
on  lop  of  our  interprocess  communicatiitn  mechanism.  We  have  chosen  to  illustraic  this  point  by  describing 
the  implementation  and  the  performance  of  pipe  data  transfer,  when  implemented  on  lop  of  the  V  kernel 
primitives.  I'rom  a  communication  standpoint,  pipes  are  different  from  file  access  in  that  llterc  exists  a 
syinineirical  relationship  between  the  two  communicating  entities:  the  reader  and  the  writer  of  a  pipe  have 
similar  responsibilities  and  privileges.  Iletween  a  user  program  and  a  file  server,  an  asyiiiiiicirical,  clicni-server 
relationship  exists;  the  client  .sends  requests  and  the  server  fulfills  them.  A  pipe  on  the  oilier  hand  is  a  form  of 
clicnl-to-clienl  communication.  I  he  alert  reader  undoubtedly  has  noted  lliat  the  asymmetric  nature  of  tile  V 
message  primitives,  with  distinct  primitives  for  clients  {Send)  and  servers  {Receive,  Reply  and  Forward), 
models  very  well  the  client-server  style  interaction  between  a  file  system  and  its  clients.  Clicnt-lo-clicnt 
communication  cannot  be  accomplished  as  directly  with  the  V  primitives:  clients  typically  only  execute  Sends 
and  never  execute  Receives'^.  One  could  go  from  there  and  argue  that  we  have  sufficiently  "adjusted"  our 
message  primitives  to  tlie  desiderata  of  file  access,  and  in  doing  so,  wc  have  undone  llic  benefits  we  hoped  to 
achieve  by  using  interprocess  communication,  namely  that  all  applications  could  be  on  top  of  die  inteiproccss 
communication  layer,  in  a  convenient  and  efficient  fashion.  In  this  section  wc  address  this  objection  by 
presenting  (he  implementation  and  the  performance  of  pipes  in  the  V-System.  In  partietdar.  wc  show  that  tlic 
experimentally  measured  maximum  data  rale  for  V  network  pipes  is  similar  to  the  estimated  maximum  data 
rate  of  a  kernel-level  implementation.  I'or  local  pipes,  some  pcrfonnancc  loss  has  to  be  tolerated,  allliough 
small  on  an  absolute  scale. 


Pipes  arc  described  as  follows.  In  Section  3.7.2  wc  describe  the  semantics  wc  expect  of  pipes.  In  Section 


10. 


Note  that  this  problem  is  not  specific  to  the  V  communicalinn  primitives,  but  is  present  in  any  interprocess  communication  system 
lhal  espouses  a  clicnI-server  model  Willi  remote  procedure  calls,  for  iiiMancc.  servers  provide  procedure  bodies  and  clients  make 
procedure  calls  Ibcre  is  no  way  for  direct  clieui-ttr-clicnl  commuiiicaiion  because  clients  do  not  provide  procedure  bodies  for  invocation 
by  other  clients. 
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1  Kb 

71.7 

376.1 

304.4 

328.4 

175.8 

246.6 

4  Kb 

62.5 

241.1 

178.6 

* 

147.8 

181.4 

• 

16  Kb 

60.2 

206.7 

146.5 

• 

140.6 

163.9 

64  Kb 

59.7 

198.0 

138.3 

138.9 

159.5 

VW 


\  ^ 


•'  -  t 
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3.7.3  wc  describe  the  iniplcmcntation  of  pipes  in  terms  of  the  V  primitives.  The  performance  of  pipes  in  this 
implementation  is  presented  in  Section  3.7.4.  Next,  in  Section  3.15,  wc  estimate  the  potential  performance  of 
pipes  in  a  kernel  implementation,  with  a  specialized  prot(x:ol  for  network  pipes.  Jn  Section  3.7.6  wc  take  a 
look  at  related  efforts  in  the  LOCUS  and  ^cn  systems.  Conclusions  on  the  subject  of  pipes  arc  drawn  in 
Section  3.7.7. 

3.7.2.  Pipe  Semantics 

A  pipe  is  a  communications  paradigm  that  provides  for  the  buffered  and  synchronized  transmission  of 
streams  of  data  between  processes.  Next,  wc  list  the  requirements  a  pipe  implementation  should  satisfy. 

Operations  A  priKCss  can  read  data  from  a  pipe  or  write  data  to  a  pipe  in  bleaks  of  a  fixed  maximum 
size.  Additionally,  there  arc  operations  for  creating,  opening  and  closing  pipes. 

Buffering  A  pipe  should  provide  some  amount  of  buffering  between  reader  and  writer.  ITic  buffered 

data  should  continue  to  be  accessible  to  the  reader  after  the  writer  has  gone  away. 

Synchronization  A  process  trying  to  read  from  a  pipe  that  is  "empty"  (i.c.  has  currently  no  data  blocks 
queued  for  reading)  is  su.spcndcd  until  more  bl<x:ks  become  available  for  reading.  Trying 
to  write  to  a  pipe  tliat  has  used  up  all  of  its  buffci's  causes  the  writer  to  block  until  some 
buffers  arc  freed  up  by  reads. 

Pipe  end  movement 

It  should  be  possible  to  change  the  reading  and/or  writing  end  of  a  pipe.  It  should  also  be 
possible  to  have  multiple  readers  and  writers  to  a  single  pipe.  It  is  assumed  that  concurrent 
access  is  serialized  in  some  convenient  fashion,  for  insUtnee  by  a  token  passing  mechanism. 

Maintaining  ordering 

llic  order  in  which  bkKks  are  written  should  be  preserved  on  tlic  reading  end,  even  if 
more  than  one  process  (on  more  than  one  machine)  writes  to  the  pipe. 

Kxception  prcKCSsing 

An  cnd'ofTilc  indication  is  returned  to  a  prixrcss  trying  to  read  from  an  empty  pipe  whose 
writing  end  has  dis;ippcared.  A  write  to  a  pipe  whose  reading  end  has  gone  away,  is 
reflected  as  successful  to  die  writer.  The  data  of  such  a  write  is  di.scardcd. 

V  pipes  are  very  similar  to  Unix  pipes  [651.  natural  extension  of  the  latter  in  a  multi-machine 

cnvironiTicnt.  Unlike  in  Unix,  a  writer  is  not  explicitly  notified  (by  a  sigpipc  signal)  when  tlic  reader  of  a  pipe 
disappears. 


.  i 


3.7.3.  Implementation  of  Pipes  Using  V  Messages 

In  a  system  based  on  specialized  protocols  pipes  arc  likely  to  be  implemented  as  a  separate  protocol  in  the 
kernel.  In  the  V-System,  pipes  arc  implemented  outside  of  the  kernel,  by  a  process  called  tlic  pipe  server,  lliis 
priKcss  uses  the  V  communication  primitives  to  communicate  with  its  clients.  Since  tlic  pipe  server  is 
accessed  through  the  V  message  primitives,  its  Uicalion  is  transparent  to  its  clients  (except  maybe  for 
performance). 

Once  a  pipe  connection  between  two  processes  is  established,  die  writer  can  execute  a  write  on  a  pipe  by 
sending  a  mcssiigc  to  die  pipe  server  containing  a  write  request.  Since  die  maximum  data  block  size  (1024 
bytes)  allowed  for  a  single  pipe  write  fits  within  the  size  of  a  segment  Uiat  can  be  appended  to  a  message,  both 
the  message  and  the  data  bl(x:k  arc  transmitted  in  a  single  operation  (a  single  packet  in  the  ease  of  network 
access).  II  the  pipe  server  is  waiting  to  receive,  it  receives  the  message  and  moves  die  data  blcKk  into  its 
eunrni  hiijfrr  using  the  Receive  operation.  It  then  links  die  bulTer  into  the  list  of  biifl'ers  written  but  not  yet 
read  for  this  pipe.  If.  however,  die  pipe  server  is  not  waiting  for  a  mcss.igc  at  the  time  die  write  request 
arrives,  the  kernel  queues  the  message  but  discards  the  data  segment.  After  receiving  the  message,  the  pipe 
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server  then  uses  the  Wovefrom  operation  to  get  the  data  block.  Having  moved  the  block  into  its  buffers,  by 
either  of  the  above  methods,  the  pipe  server  then  normally  sends  a  Reply  to  the  writer,  except  when  the  pipe 
has  used  up  all  the  buffers  it  had  been  allcxrated.  In  this  ease,  the  Reply  is  delayed  until  a  buffer  is  freed  by  a 
read  request.  Thus,  blocking  the  writer  when  tltc  pipe  is  full  is  done  by  delaying  the  Reply  to  its  last  write 
request. 

Similarly,  reading  from  a  pipe  is  accomplished  by  sending  a  message  with  a  read  request  to  tlic  pipe  server. 
If  buffers  arc  queued  for  this  pipe,  the  server  unlinks  tlio  first  buffer  out  of  the  queue  and  transmits  it  to  the 
reader.  Again,  since  the  maximum  pipe  daui  block  fits  within  the  maximum  size  for  appending  segments  to 
messages,  this  can  be  done  in  a  single  operation  by  appending  the  bUxrk  to  the  Reply  message.  If  no  buffers 
are  available  to  be  read,  the  reader  is  blocked  by  delaying  the  Reply  until  new  buffers  arc  written. 

So.  under  normal  circumstances,  two  message  exchanges  (one  for  the  read  and  one  for  the  write)  arc 
necessary  to  get  a  bk)ck  from  the  reader  to  the  writer.  If  the  pipe  server  is  not  ready  to  receive  when  a  write 
request  comes  in,  an  extra  MoveFrom  is  necessary.  If  all  processes  involved,  the  reader,  tlic  writer  and  the 
pipe  server  arc  on  different  machines,  this  leads  to  4  packets  in  the  former  ease,  and  6  in  tlic  latter  (assuming 
no  retransmissions).  If  the  reader  is  located  on  the  same  machine  as  the  pipe  server,  with  the  writer  on  a 
different  machine  2  (4)  packets  arc  ncccs.s;iry.  finally,  when  the  writer  and  the  pipe  server  arc  colocated,  with 
the  reader  on  a  different  machine,  the  exchange  should  require  2  packets  in  either  ease  (See  Figure  3-9). 
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3.7.4.  Performance  of  V  Pipes 

In  this  section  wc  present  the  performance  figures  obtained  experimentally  for  V  pipes,  'fable  3-13  shows 
measurement  results  for  V  pipes  on  lOMhz  SUN  workstations  connected  by  a  3  Mb  Hthcmet.  In  the  table  wc 
use  the  following  notation;  R  for  the  reader  process,  W  for  tlic  writer  process  and  P  for  the  pipe  server.  A 
single  dash  separating  two  letters  indicates  tlic  processes  corresponding  to  die  letters  arc  on  the  same 
machine;  triple  dashes  indicate  the  two  pnxrcsses  are  on  different  machines.  At  the  time  of  the 
measurements,  few  other  network  traffic  was  present,  so  network  contention  should  be  all  but  absent  from 
our  figures.  Kurtliermore,  tlie  pipe  server,  the  reader  and  the  writer  prcxress  were  the  only  processes  running 
on  the  respective  machines.  All  transfers  were  done  at  tlic  maximum  pipe  block  size  of  1024  bytes.  We  now 
analyze  how  the  toUil  cost  of  transferring  a  bItKk  through  a  pipe  breaks  down  over  the  different  components 
in  its  implementation.  I'he  primary  cost  components  arc  the  two  Send-Receive-Reply  sequences,  with  the 
appropriate  daUi  appended  plus  the  oveihcad  in  the  pipe  server.  In  the  local  case,  the  two 
Send-Receive-Reply  sequences  add  up  to  4.25  milliseconds,  ntc  remaining  1.25  milliseconds  must  be 
accounted  for  by  buffering  and  other  pipe  server  overhead.  One  of  the  design  decisions  made  in  the 
implementation  of  the  pipe  server  was  to  all(x;atc  data  buffers  dynamically  rather  than  statically,  in  order  to 
reduce  die  static  code  size  of  the  pipe  server.  Reversing  this  decision  could  probably  reduce  the  buffering 
overhead  by  some  amount.  However,  it  would  increase  the  static  code  size  of  the  pipe  .server,  making  it  less 
attractive  to  suition  a  pipe  server  on  every  machine. 

V  Pipe  Performance 


Configuration 

Flapscd  Time 

Data  Rate 

R-P- W 

5.5 

180 

R  -  P  -  W 

22.0 

45 

R  -  P  -  W 

lO.l 

100 

R  -P  -W 

18.2 

55 

I'ablc  3-13;  V  Ripe  I’erfonnance  (times  in  mscc..dalit  rate  in  Kbytes  per  sec.) 

When  tile  reader  and  the  writer  are  on  different  machines,  and  tlie  pipe  server  is  kicated  with  die  writer,  die 
el.ipsed  time  is  10.1  milliseconds.  This  data  transfer  is  made  up  of  a  liKal  message  exchange  with  1  kilobyte  of 
d.ita  (2.1.!  milliseconds),  a  remote  message  exchange  with  I  kilobyte  of  data  ('1,5  milliseconds),  and  the  pipe 
server  overhead  (again  estimated  at  1.25  milliseconds).  This  adds  up  to  12.8  milliseconds,  indicating  some 
amount  of  concurrency  between  the  two  machines  (The  two  machines  are  concurrently  active  during  about  27 
percent  of  die  time  it  takes  to  do  do  die  transfer.) 

When  both  the  reader  and  the  writer  are  on  a  different  machine  from  die  pipe  .server,  performance  drops 
approximately  by  a  factor  of  2  vs.  die  case  of  the  writer  and  the  pipe  server  being  on  the  same  machine.  ITiis 
is  explained  by  the  fact  that  now  two  remote  messiige  exchanges  (with  die  appropriate  segments  appended) 
have  to  transpire.  Using  the  same  argument  as  in  the  previous  paragraph,  this  would  amount  to  20.2 
milliseconds  vs.  an  empirically  observed  value  of  18  milliseconds,  again  indicating  some  amount  of 
concurrency. 

I'he  perform.incc  is  comparatively  poor  in  the  case  of  the  writer  being  on  a  different  machine  from  the 
reader  and  the  server.  At  first  glance,  diere  is  no  reason  to  expect  the  performance  of  this  case  to  be  different 
•rom  the  performance  measured  for  die  case  of  die  reader  being  on  a  different  machine  from  the  writer  and 
die  pipe  server.  The  poor  performance  is  partially  due  to  an  artifact  of  the  measurement  and  partially  to  the 
current  implementation  of  Receive.  Rcinember  that  the  segment  appended  to  a  5’c«rfgets  discarded  when  the 
receiver  (in  this  case  the  pipe  server)  is  not  waiting  to  receive  a  message.  In  this  case,  the  sequence  of  events  is 
as  follows.  I'he  pipe  server  docs  a  Receive.  The  first  message  from  die  writer  arrives  with  the  data  block 
appended  to  it.  Since  the  pipe  server  was  waiting  for  a  message,  the  segment  gets  read  in  immediately.  Ilic 
write  request  is  replied  to  and  the  pipe  server  priKCcds  to  transmit  die  packet  to  die  reader  using  Reply. 
However,  in  die  mean  time,  die  writer  has  received  the  Reply  to  its  first  write  and  sends  off  its  second  request. 
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Due  to  the  particular  timing  of  events  in  this  experiment,  this  message  arrives  at  the  server  before  tlie  latter  is 
done  with  the  Reply  and  tlierefore  tlie  segment  gets  discarded.  Now  when  the  server  gets  ready  to  receive  the 
message,  it  has  to  do  a  MoveFrom  before  it  gets  the  data  block.  ITiis  sequence  of  events  repeats  itself  for 
every  block  and  causes  performance  inferior  by  about  a  factor  of  2  to  what  one  would  be  lead  to  expect.  In 
practice,  one  would  expect  that  some  amount  of  pr<Kcs.sing  between  blocks  by  both  the  reader  and  tlie  writer 
would  prevent  this  peculiar  sequence  of  events  from  happening. 

'riiesc  results  suggest  the  following  strategy: 

1.  Have  a  pipe  server  resident  on  every  machine. 

2.  Have  a  particular  pipe  managed  by  the  pipe  server  on  the  machine  on  which  the  writer  resides. 

The  first  part  of  this  strategy  is  easily  accomplished.  Since  the  static  code  size  of  the  pipe  server  is  quite  small 
(11  Kbytes),  this  does  not  impose  any  heavy  memory  demands  on  the  workstation.  The  pipe  server  can  easily 
be  loaded  as  part  of  the  system  prcKesses  that  are  loaded  when  booting  the  workstation,  'nic  second  part 
presents  somewhat  of  a  problem  and  is  not  hilly  implemented  right  now.  Typically,  a  pipe  is  created  by  an 
executive,  and  then  both  ends  of  the  pipe  arc  passed  on  to  processes  created  by  tlie  executive.  Currently,  the 
executive  always  asks  die  pipe  server  on  its  machine  to  create  the  pipe  and  tliis  pipe  server  manages  the  pipe 
throughout  its  existence.  If  the  writing  end  is  passed  to  a  process  on  .i  different  rr..ichine,  performance  is 
suboptimal.  Ihis  strategy  is  adequate  for  our  current  environment.  However,  once  more  sophisticated 
remote  execution  facilities  become  available,  we  might  review  this  decision.  This  issue  is  a  special  case  of  the 
more  general  problem  of  efficiently  supporting  multiple  writers  to  a  pipe  on  different  machines.  We  consider 
this  problem  next 

Consider  the  case  of  multiple  writers  to  a  pipe,  some  of  which  reside  on  different  machines.  In  our  current 
implementation,  all  data  passes  through  a  single  pipe  server,  located  on  tlie  machine  where  the  pipe  was 
created.  At  first  glance,  it  would  seem  Uiat  this  approach  resulLs.  quite  wastcfully.  in  an  extra  "hop"  for  all 
data  not  written  from  the  pipe  server's  machine.  It  would  seem  advantageous  to  move  the  daui  directly  from 
the  writer's  machine  to  the  reader's  machine,  without  intermediate.  However,  the  situation  is  severely 
complicated  by  tlie  requirement  (Sec  Section  ,1.7.2)  that  the  order  in  which  blixrks  are  written  to  the  pipe  is 
preserved  at  the  reading  end.  T'or  insuincc,  if  the  creator  writes  something  into  tlie  pipe,  then  passes  on  the 
writing  end  to  a  prcxTcss  on  another  machine,  and  then  this  process  writes  into  the  pipe,  the  order  of  writing 
has  to  be  preserved  on  the  reading  end.  A  single,  centralized  pipe  server  automatically  accomplishes  this 
preservation  of  order.  If  multiple  pipe  servers  are  buffering  data  for  a  single  pipe  (with  the  objective  of 
having  the  pipe  .server  on  the  s<ime  machine  as  where  the  data  is  produced),  tliere  must  be  some  additional 
protiK'ol  to  ensure  preservation  of  order.  A  possible  approach  would  be  to  have  a  token  passing  mechanism, 
whereby  only  the  Imldcr  of  die  token  would  be  ;.llowed  to  write  into  tlic  pipe.  Ucfoi  c  the  token  can  be  pas,scd 
and  a  new  process  is  allowed  to  write  into  the  pipe,  the  current  pipe  server  must  be  informed  tJiat  tlierc  will  be 
a  new  writer  and  the  new  pipe  server  must  be  determined.  The  current  pipe  server  could  cither  forward  the 
buffers  it  has  queued  for  reading  to  die  next  oipc  server,  or  i*.  can  keep  die  buffers  but  provide  the  reader  with 
the  address  of  the  new  pipe  server  when  all  of  its  bufTcrs  have  been  read.  If  moves  are  relatively  infrequent, 
such  a  strategy  results  in  better  performance  than  the  centralized  pipe  server.  At  present,  however,  we  have 
deemed  Uiis  complication  unnecessary.  Note  diat  bufi'ei’ng  the  daui  in  die  pipe  server  on  die  reading  site 
(even  ignoring  die  performance  difTiculties  diaussed  earlier  with  this  approach)  docs  not  solve  die  problem 
because  of  the  possibility  of  the  reading  end  being  pas.scd  to  a  dilTcrenl  machine. 

3.7.5.  A  Kernel-Level  Implementation 

In  diis  section  we  provide  a  reasonable  estimate  of  die  maximum  data  rate  that  could  be  accomplished 
through  a  kcrncTIcvcl  implementation  of  pipes,  ilic  nature  of  the  buffered  communication  mode  as.sociatcd 
with  pipes  requires  that  at  least  two  copies  of  die  data  be  made:  one  from  die  writer  to  the  intennediate  buffer 
and  one  from  the  buficr  to  the  reader,  l  -ithcr  one  or  both  of  these  copies  could  be  subsumed  by  unmapping 
die  virtual  memory  page  from  its  origin.il  location  .ind  then  mapping  it  into  its  final  location.  Alternatively, 
one  can  imagine  ridding  oneself  of  one  of  the  copies  by  allowing  the  writer  to  fill  a  bulVer.  specify  its  address 
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and  size  to  the  kernel,  and  tlicn  allowing  it  to  proceed  without  a  copy  of  the  buffer  being  made''.  When  the 
reader  tlien  asks  for  this  block  of  data,  it  can  be  copied  from  its  original  buffer  to  the  reader  s  buffer  without 
an  intermediate  copy.  This  technique  imposes  on  the  writer  the  responsibility  not  to  touch  the  corresponding 
buffer  until  notification  has  been  received  tliat  the  reader  has  made  its  copy  of  the  data.  Part  of  this  process 
can  be  automated,  again  by  using  a  virtual  memory  technique,  called  copy-on- wriie,  whereby  tlic  kernel  makes 
a  copy  of  the  relevant  virtual  memory  page(s)  only  when  the  writer  attempts  to  write  into  the  buffer  [63].  In 
order  to  be  able  to  accurately  compare  a  prixess-lcvel  and  a  kernel-level  implementation,  we  assume  that  no 
page  map  trickery  is  used  and  that  a  copy  is  effectively  made  (Note  that  the  implementation  of  tlic  message 
primitives  does  not  rely  on  such  trickery  citlier.)  Besides  the  cost  of  copying  data,  a  number  of  other  fictors 
have  to  be  taken  into  account  in  order  to  come  up  with  a  reasonable  estimate  on  the  time  it  takes  to  transfer  a 
block  of  daui  through  a  pipe.  'ITicsc  include  tlie  overhead  of  die  system  calls  to  execute  both  the  read  and  the 
write  operations,  prwess  switching  overhead  and  die  cost  of  dynamically  allixating  buffers. 

In  the  case  of  a  pipe  server-based  implementation,  die  intermediate  buffeis  reside  in  die  pipe  server.  Data 
is  moved  from  the  writer  to  the  pipe  server  and  from  the  pipe  server  to  die  reader  by  means  of  the 
interprixess  communication  primitives.  In  a  kernel-based  implementation,  die  intermediate  buffers  reside  in 
the  kerncl(s).  If  the  reader  and  the  writer  are  on  the  same  machine,  a  copy  is  made  from  the  writer  into  the 
kernel  and  dicn  from  the  kernel  to  the  reader.  If  die  reader  and  die  writer  are  on  diflerent  machines,  and  the 
buffers  are  kept  in  one  or  both  of  the  kernels  on  diose  machines,  a  copy  is  made  from  the  writer  to  its  kernel, 
from  die  writer’s  kernel  to  the  reader’s  kernel  and  then  from  die  reader’s  kernel  to  die  reader.  If  the  buffers 
arc  in  a  different  kernel,  for  instance  because  the  writer  has  moved  and  die  buffers  remain  at  their  initial 
location,  an  extra  copy  is  required  into  the  intermediate  kernel.  In  order  to  estimate  the  perfonnance  of  a 
kernel-based  implementation,  we  first  estimate  the  cost  of  a  process-to-kerncl  copy  and  a  kcrncl-to-kcrnel 
copy.  We  estimate  the  cost  of  a  proccss-to-kcrncl  copy  by  die  cost  of  an  (intra-address)  Aftnero  operation  of 
the  same  si/c.  Indeed,  the  local  A/weT’o  operation  provides  a  means  for  moving  data  at  a  cost  only  slightly 
higher  dian  the  "raw"  copy  cost  (Sec  [15)  (br  a  further  discussion).  The  fact  that  the  cost  is  slightly  higher 
stems  from  the  overhead  for  access  rights  checking  and  for  executing  the  /Move To  system  cal'.  Nomially,  one 
would  expect  die  /lead  or  H^n'te  system  calls  in  a  kernel-based  implemenialion  of  pipes  to  incur  a  similar 
overhead.  Unis,  this  estimate  is  quite  accurate.  I'lic  cost  of  a  kcrncl-to-kcrnel  copy  is  cslinialcd  by  the  cost  of 
a  remote  MovcTo:  we  have  shown  that  remote  Move  To  operations  arc  a  means  for  transfci  ring  data  across  the 
networks  at  speeds  close  to  the  network  penalty,  which  is  the  best  a  kcrncl-lo-kernci  copy  could  achieve 
(Comp.ire  for  instance  the  network  penalty  column  to  the  elapsed  time  for  a  MovcTo  in  I'ablcs  3-5  to  3-8.) 
While  there  is  a  slight  additional  cost  for  executing  die  system  call,  this  overhead  is  negligible  compared  to  die 
network  pcn.ilty  for  the  size  of  transfers  we  arc  considering  (1  Kbytes). 

Ikised  on  tliis  argument,  we  estimate  the  cost  for  transferring  a  1024  byte  block  through  a  local  pipe  to  be 
the  sum  of  the  following  components: 


copy  from  the  writer  to  the  kernel  1.2  msec, 

copy  from  die  kernel  to  die  reader  1.2  msec, 

buffering  overhead  in  die  kernel  1.2  msec. 

prtK'css  sw  itch  between  reader  and  writer  0.2  msec. 


I'hc  buffering  o\  erhead  in  the  kernel  is  assumed  to  be  identical  to  the  buffering  overhead  in  the  V  pipe  server, 
file  piiKcss  swiiching  time  is  that  observed  for  the  V-System.  I  he  resulting  cost  is  duis  3.8  milliseconds  to  be 
compared  to  the  empirically  measured  5.5  milliseconds  for  V  pipes.  In  V.  a  pipe  data  iransfei  is  performed  by 
executing  two  .Ve/«/-/^cmve-/fc/»/>' sequences,  one  from  die  writer  to  die  pipe  server,  and  one  from  die  pipe 
server  to  die  reader.  I  bis  results  in  four  copies  radicr  dian  two:  since  the  pipe  server,  the  reader  and  the 
writer  arc  all  in  dilTcrcnt  address  spaces,  a  copy  must  be  made  from  die  writer  to  the  kerners  copy  segment; 
from  dicrc  to  the  pipe  server;  from  the  pipe  server  to  the  kernel’s  copy  segment;  and  from  there  to  the  reader. 
These  two  extra  copies,  together  with  the  extra  process  switch  result  in  an  extra  cost  of  1.7  milliseconds. 
Although  significant  on  a  relative  scale  (30  percent),  die  absolute  value  of  die  difrcrcncc  between  both 
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implementations  is  quite  small.  We  expect  that  with  a  maximum  data  rate  of  180  kilobytes  per  second  we  are 
able  to  satisfy  the  needs  of  most  applications. 

For  tlie  two-machine  case  (reader  and  writer  on  different  machines),  it  is  more  difficult  to  make  a 
reasonable  estimate  because  of  potential  concurrency  effects.  Temporarily  ignoring  such  effects,  we  simply 
sum  the  cost  of  tlic  different  components  of  a  two-machine  pipe  transfer: 


copy  from  the  writer  to  the  writer’s  kernel  1 .2  msec, 

copy  from  the  writer’s  kernel  to  the  reader’s  kernel  8.0  msec, 

copy  from  the  reader's  kernel  to  the  reader  1.2  msec, 

buffering  overhead  in  the  writer’s  kernel  1 .2  msec, 

buffering  overhead  in  the  reader’s  kernel  1.2  msec. 


This  adds  up  to  a  toLnl  of  12.8  milliseconds.  Assuming  an  identical  amount  of  concurrency  as  empirically 
observed  for  V  pipes  (27  percent)  would  result  in  an  estimate  of  10.1  milliseconds,  identical  to  the  value 
experimentally  measured  for  die  V-System.  'I'his  surprising  result  is  the  consequence  of  the  following  factors. 
In  the  V  implcmenuition,  data  is  buffered  in  the  pipe  server  and  only  there.  In  the  kernel-based 
implementation,  data  is  buffered  in  both  kernels.  As  a  result,  the  bufferittg  overhead  is  incurred  twice  and  an 
extra  eopy  is  needed  on  the  reader's  machine,  whereas  in  the  V-System,  daui  is  copied  immediately  from  the 
network  interface  into  the  reader's  address  space.  Ihese  disadvantages  of  a  kernel-based  implementation  are 
compensiited  by  extra  overhead  in  the  V  implementation  resulting  from  the  fact  dial  in  the  latter  data  has  to 
be  copied  from  die  writer  to  the  pipe  server  (rather  than  from  the  writer  to  die  kernel),  die  context  switch 
between  the  writer  and  the  pipe  server,  and  finally  the  fact  that  the  reader  has  to  explicitly  ask  for  die  data 
across  die  network,  whereas  in  the  kernel-based  implementadon  data  is  assumed  to  migrate  to  dic  reader’s 
machine  automatically. 

In  the  case  t'f  multi-machine  pipes,  higher  data  rates  could  be  achieved  if  one  were  to  allow  larger  block 
sizes  (spanning  potentially  multiple  packets),  and  if  one  were  to  stream  these  packets  across  die  network 
without  intervening  acknowledgements.  In  V.  the  Siimc  streaming  could  be  accomplished  by  using  MoveTo 
and  MoveFrom  with  large  blocksi/es.  radier  than  appending  segments  to  the  Send iuid  Reply  messages. 

I  'inally,  we  wish  to  consider  again  the  case  of  multiple  writers.  A  kernel-level  implementation  of  pijies  must 
face  the  Siime  issue  as  a  pipe  server-based  implementation,  namely  die  requirement  that  the  order  of  bUxrks  is 
preserved.  I  he  approaches  suggested  at  die  end  of  Section  3.7.4  remain  valid:  either  all  data  passes  through  a 
single  kernel  and  gets  serialized  diere.  or  some  proUxrol  is  executed  between  the  dilTcrcnt  kernels  to  guarantee 
the  appropriate  ordering.  As  a  final  point  in  diis  comparison,  we  wish  to  compare  the  experimental  figures 
from  the  V-System  for  the  three  machine  case  to  an  estimate  for  a  kernel-based  implementation  for  that  case. 
Again,  we  temporarily  ignore  possible  concurrency  elTccus  and  sum  die  cost  of  the  difl'crent  cost  components: 


copy  from  the  writer  to  the  writer’s  kernel  1.2  msec, 

copy  from  the  writer's  kernel  to  the  intennediate  kernel  8.0  msec, 

copy  from  the  intermediate  kernel  to  the  reader’s  kernel  8.0  msec, 

copy  from  the  reader's  kernel  to  die  reader  1.2  msec, 

buffering  overhead  in  the  writer’s  kernel  1.2  msec. 

bulTering  overhead  in  the  intermediate  kernel  1.2  msec, 

bufl'cring  overhead  in  die  reader’s  kernel  1.2  msec. 


Ihis  leads  to  a  toUil  of  22.0  milliseconds,  and  again  assuming  an  identical  amount  of  concurrency  as  that 
observed  experimentally  in  die  V-System.  to  an  elapsed  time  of  19.8  milliseconds.  This  value  is  slightly  higher 
than  diat  observed  for  the  V-System  (18.2  milliseconds).  Ihe  explanation  of  this  result  is  similar  to  the 
argument  held  for  the  two  machine  case.  The  kernel  implementation  pays  a  penalty  by  incurring  the 
buffering  overhead  three  times  instead  of  once  and  by  having  to  make  an  extra  copy  on  the  reader's  machine. 
Its  advantage  over  the  V  implementation  results  from  die  reader  not  having  to  go  across  the  network  to  ask  for 
the  data. 
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3.7.6.  Pipe  Implementations  in  Other  Systems 

In  systems  based  on  problem-oriented  protocols,  one  would  expect  pipes  to  be  implemented  as  a  separate 
protocol  in  the  kernel.  'ITiis  is  the  ease,  for  instance,  in  the  I.OCUS  operating  system.  As  in  V,  three  sites  can 
potentially  enter  into  piped  data  transfer  between  two  pnKCSscs:  the  writer  site,  the  storage  site  and  the  reader 
site.  'Ihe  logical  flow  of  data  is  from  tlic  writer  site  to  tlic  storage  site  and  from  tlicre  on  to  tlic  reader  site.  In 
first  order  approximation,  one  can  tliink  of  the  storage  site  as  the  equivalent  of  tlic  pipe  server  in  V.  However, 
l,OC'US  uses  a  difTcrent  bufTcring  strategy.  Unlike  in  V.  where  daui  is  buffered  in  the  pipe  server  process  (and 
only  there),  all  of  the  I.OCUS  kernels  involved  potentially  provide  intermediate  buffering.  The  kernel  on  each 
of  the  three  (logical)  sites  maintains  its  own  view  of  the  pipe,  in  particular  of  the  pipe’s  current  size  and  tlic 
read  and  write  pointer.  For  instance  on  the  reading  site,  the  reader  process  reads  from  the  pipe  until  the 
buffers  in  its  kernel  arc  empty.  At  tliat  point,  it  is  put  to  sleep  and  the  kernel  on  that  machine  contacts  tlic 
kernel  on  the  storage  site  for  more  data.  I'hc  storage  site  then  returns  any  available  data,  and  the  pipe  size  in 
the  storage  site  and  in  die  reading  site  get  updated  aa'ordingly.  Similarly,  on  the  writing  site  the  writer 
process  pKxrccds  until  its  buffers  arc  full,  at  which  point  its  kernel  contacts  the  storage  site  kernel.  Data  may 
also  migrate  asynchronously  from  die  writing  site  to  the  storage  site,  and  from  the  storage  site  to  the  reading 
site.  Special-case  code  in  the  kernel  provides  the  obvious  optimizations  when  two  or  more  logical  sites 
coincide  with  a  single  physical  site.  From  die  available  diKumcntation.  it  would  appear  diat  the  storage  site 
tor  a  particular  pipe  always  remains  at  its  initial  location  and  that  no  attempt  is  made  to  bypass  the  storage  site 
if  both  the  reader  and  die  writer  are  on  a  machine  different  from  the  one  functioning  as  the  storage  site. 

In  the  Fden  system,  a  pipe-like  interconnection  mechanism,  called  transput,  is  provided.  Fxlcn  objects 
consists  themselves  of  .several  concurrent  processes.  Ihe  Hden  transput  mechanism  takes  advantage  of  this  by 
iiaving  the  data  written  by  the  writer  object  and  not  yet  read  by  the  reader  object  buffered  by  a  separate 
prtKCss  within  the  writer  object.  Thus,  when  a  write  is  done  to  a  pipe,  the  data  is  copied  from  the  process 
within  the  writing  object  that  performed  the  write  to  the  buffering  process.  When  a  read  comes  in  later,  that 
request  is  directed  to  the  bufl'cring  process  which  tries  to  satisfy  it  out  of  its  buffers.  Transput  is  dius 
accomplished  without  any  kernel  support  other  dian  the  normal  Fden  invocation  primitives.  Ihe  motivation 
for  this  particular  approach  is  to  reduce  die  number  of  inviKations  per  bliKk  transferred  from  two  to  one  plus 
a  much  chc.ipcr  intra-object  procedure  call.  This  is  particularly  imporuint  in  l%dcn  because  the  invocation 
mt'chanism  is  rather  heavy-handed  and  correspondingly  expensive. 

I  bis  optiini/ation  is  made  at  the  expense  of  a  number  of  important  concessions  on  die  semantics  of  pipes, 
f  irst,  it  seems  that  data  disappears  as  smm  as  the  writer  disappears,  contrary  to  one  of  the  goals  we  set  forth  in 
Section  .1.7.2.  Second,  die  fact  that  data  is  bulTcrcd  within  one  of  the  communicating  entities  removes  a  level 
*)l'  indirection  between  the  reader  and  the  writer.  Ihis  level  «tf  indirection  is  quite  useful  when  the  writer 
changes  over  the  pipe's  lifetime.  In  V  for  insUmcc.  when  die  writing  end  changes,  diis  docs  not  affect  the 
reader:  it  just  continues  to  send  messages  to  die  pipe  server.  In  lidcn,  it  is  ncces.viry  for  the  reader  to  be 
cornccted  to  the  writer  directly.  When  the  writer  changes,  this  connection  has  to  be  explicitly  changed, 
presumably  by  means  of  an  additional  primitive  in  die  reader,  Tlic  multiple  write  ca.se  is  even  more 
problematic.  In  diis  ease,  a  pipe  server-like  Fden  object  must  be  interposed  between  the  different  writers  and 
the  reader. 

As  a  final  note  on  this  subject,  we  wish  to  point  out  that  the  V-Systcin  allows  a  similar  multi-process 
structure  as  tlie  F,dcn  system,  with  multiple  processes  forming  a  single  team  and  residing  in  a  single  address 
space.  I  he  lilmiry  fimciion  for  fTnVe  cotild  create  a  pnKCSs  dial  would  pcifomi  the  same  limclion  as  the 
bulTer  process  in  the  Fden  writer  object.  For  a  local  pipe  transfer,  the  cost  would  dicn  equal  the  sum  of  a 
local  (intr.i-addrcss  space)  MoveTo.  a  local  Send-Keceire-Rcpfy  w\ih  the  appropriate  data  appended,  and  the 
buffering  overhead,  resulting  in  a  loUil  tif  4.6  milliseconds,  or  16  percent  better  dian  the  measured  V 
performance.  For  the  two  machine  ease,  the  Itical  Send-Kcceire-Rcpfylus  to  be  replaced  in  die  calculation  by 
a  remote  t)nc,  resulting  in  a  touil  of  1 1.9  milliseconds.  Assuming  again  an  identical  amount  of  concurrency  as 
cmpiric.illy  t>bscrvcd,  the  cl.ipscd  time  would  be  9.4  liiilliscconds  or  7  percent  better  dian  die  measured  V 
pcrfonnancc.  We  deem  these  gains  insignificant  compared  to  the  concessions  diat  must  be  made  on  the 
semantics  of  pipes  in  order  to  accomplish  diem. 
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3.7.7.  Conclusion 

In  this  section  we  have  compared  the  experimentally  merisurcd  performance  of  pipes  in  die  V-System  to  the 
estimated  performance  of  a  kernel-based  implementation.  In  V.  pipes  arc  implemented  using  the  interprocess 
communication  facilities.  There  are  no  special  kernel  primitives  for  pipes,  nor  is  diere  a  special  protocol  to 
support  network  pipes.  In  a  kernel-based  implementation,  there  arc  evidently  read  and  write  kernel 
primitives,  plus  a  proUKol  to  implement  them  across  machines. 

'Hie  goal  of  this  section  was  to  make  an  assessment  of  the  cr)St  of  layering  pipes  on  top  of  die  V  inlcrpriKCSS 
communication,  in  comparison  to  a  kernel-based  implementation.  We  have  shown  diat  the  maximum  data 
rate  through  V  network  pipes  compares  very  well  with  die  estimated  maximum  data  rate  of  a  kernel-based 
implementation,  because  extra  buffering  in  the  kernel-based  implementation  compensates  for  die  additional 
overhead  of  the  V  message  passing  primitives.  For  kKal  pipes,  some  penalty  has  to  be  paid  because  of  extra 
copies  necessary  in  the  V-System.  While  percentage-wise  the  penalty  might  seem  high  (30  percent),  the 
difference  in  absolute  terms  is  rather  small  (1.7  milliseconds  per  kilobyte). 

We  have  also  discussed  the  problems  assixiatcd  with  the  ends  of  pipes  moving  away  from  their  original 
machine.  So  far  we  have  found  the  simple  approach  of  a  single,  centralized  pipe  server  satisfactoi>  aldiough 
this  as.sessmcnt  might  change  as  more  sophisticated  remote  program  execution  facilities  become  available. 
We  have  shown  that  this  ease  could  be  handled  cfTicicntly  with  appropriate  modificaiions  to  the  pipe  server. 

Finally,  we  have  presented  details  of  pipe  implementations  in  different  system.  For  the  Fden  system  in 
particular,  this  has  allowed  us  to  demonstrate  the  cost  of  certain  aspects  of  the  definition  of  pipes.  When  the 
definition  is  somewhat  relaxed  as  in  Eden,  certain  performance  bcncHts  can  be  accomplished. 


3.8.  Chapter  Summary 

In  this  chapter,  we  have  presented  die  V  interprocess  communication  primitives  and  their  performance.  We 
have  also  shown  the  performance  of  two  applications,  files  and  pipes,  which  arc  from  a  communications 
viewpoint  significantly  dilTcrcnt.  Ily  comparing  the  measured  performance  of  Uiese  applications  to  the 
optimally  achievable  performance  in  the  given  hardware  environment,  we  have  shown  that  the  penally  for 
implementing  these  applications  on  top  of  die  V  interprocess  communication  primitives  is  sm.ill  and  diat 
therefore,  die  benefiLs  to  be  expected  from  specialized  protiKols  for  diese  applications  arc  minimal. 

'1110  measurements  in  this  chapter  arc  taken  in  an  otherwise  unloaded  environment  and  reflect  performance 
in  a  particular  hardware  environment.  In  the  next  chapter,  we  use  these  measurcmcnls  as  inputs  to  a 
queueing  network  model  of  network  page-level  file  acccs.s.  This  analysis  leads  to  an  undcrslanding  of  the 
performance  in  dilTcrcnt  hardware  environments  and  under  more  reasonable  load  conditions. 
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4.1.  Introduction 

In  the  previous  chapter  we  have  concentrated  on  measurements  of  V  interprocess  communication  in  a 
particular  hardware  environment  and  in  an  otherwise  idle  system.  While  interesting  in  their  own  righL  these 
measurements  convey  in  themselves  little  information  as  to  how  users  would  perceive  the  overall  performance 
of  the  V-System  in  a  more  realistic  environment.  First  users  perceive  the  efficiency  of  a  system  through  the 
performance  of  the  applications  they  run,  and  not  through  the  cost  of  individual  low-level  communication 
primitives.  In  this  chapter,  we  consider  sequential  file  access  as  our  application  of  interest,  because  of  its 
Ibremost  importance  in  determining  overall  system  performance.  We  compare  remote  vs.  local  sequential  file 
access,  a  topic  which  is  of  paramount  importance  in  our  environment  of  diskless  workstations.  Second,  in  a 
realistic  environment,  several  (diskless)  workstations  are  competing  for  the  services  of  the  file  server.  'ITtis 
introduces  queueing  delays  at  tlie  network,  the  file  server  and  the  disk,  which  enter  into  the  user's  perception 
of  the  performance  of  the  system.  Third,  we  would  like  to  predict  how  system  performance  changes  as  some 
design  parameters  are  changed.  Among  the  parameters  we  amsider  are  both  hardware-related  measures  such 
as  network  data  rate,  processor  speed  and  disk  characteristics  as  well  as  software- related  parameters  such  as 
the  size  of  the  interaction  between  clients  and  tlic  file  server,  and  various  buffering  and  caching  schemes. 

In  order  to  address  these  questions,  we  build  a  queueing  network  model  of  die  system  under  consideration. 
Ihis  allows  us  to  predict  performance  under  different  loads  as  well  as  in  different  environments,  where 
measurements  would  be  difficult  or  impossible.  Starting  from  measurements  on  die  V-System.  we  first 
compare  the  pci  formaiicc  of  UKal  vs.  remote  sequeiitiai  file  iiecess  when  only  a  single  client  acces.ses  the  file 
server.  I  licsc  measurements  arc  then  used  as  inputs  to  a  queueing  network  model,  allowing  us  to  predict  die 
performance  degradation  of  remote  sequential  file  access  under  load.  In  order  to  assess  die  benelils  of  a 
particular  modification  to  the  baseline  system,  we  extrapolate  frtim  die  experimental  measurements  what  die 
results  would  look  like  for  die  modified  system.  Hy  feeding  suiuibly  modified  inputs  into  the  queueing 
network  model,  we  can  predict  the  performance  improvements  such  a  modification  would  yield. 

The  outline  of  this  chapter  is  as  follows.  In  Section  4.2  we  present  the  canonical  system  under  consideration 
and  its  representation  as  a  queueing  network  model.  Section  4.3  contains  measurement  results  used  as  input 
datii  to  the  model  and  the  evaluation  of  the  model  in  this  baseline  configuration.  Next,  in  Section  4.4,  we 
examine  the  effects  of  various  modifications  to  die  baseline  model.  Finally,  in  Section  4.5,  we  summarize  the 
basic  conclusions  from  this  modeling  study.  A  similar  study  based  on  Unix  4.2Hsd(52].  with  some 
comparison  to  V  and  Apollo  Domain  [44),  appears  in  (42). 


4.2.  The  Canonical  System 
4.2.1 .  The  System  and  its  Model 

ITic  canonical  system  under  consideration  consists  of  a  number  of  diskless  workstations  accessing  a  file 
server  over  a  local  area  network  (Sec  Figure  4*1).  'I)ic  workstations  use  the  file  .server  for  all  permanent 
storage.  Wc  do  not  consider  the  effects  of  paging  in  this  model.  It  is  assumed  diat  either  the  paging  rates  arc 
very  modest  or  that  a  local  disk  is  available  for  paging. 

Figure  4-2  shows  the  queueing  network  model  used  to  represent  the  system.  Wc  use  the  terminology  of  [43] 
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Figure  4*1 :  I'hc  Canonical  System 

to  describe  the  dilTerent  compi)nents.  Resources  arc  represented  by  service  centers,  with  a  specific  service 
dcniand  per  request.  Clients  or  customers  ofTcr  a  certain  workload  to  die  service  centers.  Wc  assume  that  at 
any  given  moment  tlierc  is  a  fixed  number  N  t)f  workstitions.  resulting  in  a  c/ascJ  queueing  network  mtxlcl, 
in  whicli  clients  generate  requests  separated  in  time  by  intervals  of  average  length  /.  (the  think  time).  I'hc 
model  includes  one  token  per  request  (Workstations  can  only  have  a  single  request  r)ulsUmding  at  any  given 
point  in  time.)  Tlic  token  cycles  iluough  tltc  nclwrnk.  iiccunuilating  service  and  encountering  queueing 
delays  as  it  visits  the  various  service  centers.  Wc  arc  interested  in  obtaining  from  the  model  the  rrsimnse  time: 
tills  is  ilic  average  round  trip  time  of  die  token  around  die  network,  from  die  moment  the  corresponding 
request  is  initiated  until  its  completion. 

In  the  model  the  file  server  is  represented  by  two  service  centers:  the  file  server  Ci’U  and  the  disk.  Rich 
workstation  is  represented  by  a  single  service  center,  namely  its  Ci>u.  Wc  assume  diat  there  is  no  contention 
and  dicrcforc  no  queueing  delay  at  the  client  workstation  CPUs.  I'hc  network  completes  die  list  of  service 
centers,  intcaonncctcd  in  the  topology  shown  in  Figure  4-2.  In  order  to  fully  specify  die  model,  wc  need  to 
charactcri/c  die  workload  generated  by  customers  and  die  service  demands  per  request  at  die  various  service 
ccntci's.  riiis  is  discussed  next 

4.2.2.  Customer  Characterization 

First,  wc  need  to  select  a  definition  of  a  customer  request.  Wc  take  a  system-oriented  view  of  a  request, 
namely  wc  let  a  request  correspond  to  doing  8  kilobytes  of  l/0’‘ .  Mternatively,  wc  could  have  taken  a  more 
uscr-orien'cd  definition,  whereby  a  (user)  command  would  be  taken  as  the  basic  request  unit.  Both 
approaches  arc  quite  possible  (Sec  for  insLincc  [29J  for  an  example  of  the  firet  approach),  and  using  our 
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Note  lhai  wc  do  not  necessarily  assume  that  I/O  Ls  done  in  uniis  of  8  kilobytes. 
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l-'igure  4-2;  The  Queueing  Network  Model  Representing  the  Canonical  System 

solution  techniques,  both  would  yield  the  same  result.  We  have  somewhat  arbitrarily  chosen  the  system- 
oriented  view  because  of  the  ease  with  which  service  demands  can  be  measured  for  this  type  of  request.  ITic 
customer  workload  is  then  described  by  the  average  rate  at  which  such  requests  arc  generated  by  a  typical 
workstation  user,  l-.vidcntly.  the  request  rate  is  strongly  dependent  on  tlic  kind  of  applications  die  system 
supports.  We  arc  interested  in  environments  where  users  spend  most  of  tlicir  time  doing  software 
dcvelopiTicnt-rclatcd  work,  as  in  a  typical  Unix  environment.  Measurements  done  at  the  University  of 
Washingion  on  a  comparable  diskless  work  station-based  system  indicate  that  worksUition  users  generate  on 
average  approximately  8  kilobytes  of  I/O  every  two  seconds  142).  Thus,  we  introduce  a  lliink  time  of  2 
seconds  in  between  customer  requests. 

4.2.3.  Service  Demands 

Service  demands  per  request  have  to  he  computed  or  measured  for  tlic  network,  the  file  server  Cl’U,  tlic 
disk  and  ihe  client  Cl’U.  Additionally,  we  have  measured  the  clap,scd  lime  for  both  local  and  remote 
operations.  We  work  with  the  figures  for  reading  only:  the  service  demands  for  writing  arc  not  significantly 
difTcrent  and  tlic  number  of  reads  typically  far  exceeds  the  number  of  writes.  The  actual  figures  for  tlic 
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baseline  model  are  reported  in  Table  4-1  in  Section  4.3.  We  briefly  describe  here  the  experiments  perfonned 
and  the  assumptions  made  in  arriving  at  these  service  demands. 

ITie  service  demands  are  generated  by  conducting  a  set  of  experiments  that  repeatedly  transfer  8  kilobytes 
of  data  between  the  client  and  the  file  server.  Service  demands  (i.c.  processor  time)  on  the  client  workstation 
and  on  the  flic  server  arc  measured  by  running  a  low-priority  "busywork"  prcKCSs  that  repeatedly  updates  a 
counter  in  an  infinite  loop  (Sec  Section  3.4).  All  other  processor  utilization  reduces  tlie  processor  allocation  to 
this  pr(K.'cs.s.  Thus,  the  processor  time  used  per  operation  is  tlic  total  elapsed  time  minus  the  prcKcssor  time 
allocated  to  the  "busywork"  process  divided  by  the  number  of  operations  executed.  This  mctliod  of 
measuring  processor  time  accurately  accounts  for  intcrnipt-lcvcl  prcKcssing  of  network  and  disk  I/O  as  well  as 
prcxrcssor  degradation  due  to  bus  interference  from  Dma  operations.  Network  service  demand  is  computed 
by  summing  tlie  length  of  tlic  packets  involved  in  the  I/O  operation  and  dividing  that  sum  by  the  network 
data  rate.  Disk  service  demand  has  three  components:  transfer  time,  rouitional  latency  and  seek  time. 
Transfer  time  and  average  rotational  latency  are  taken  from  the  manufacturer's  device  specifications.  In  order 
to  compute  the  average  seek  time,  we  measure  the  average  seek  disumce  (in  cylinders).  The  average  seek  time 
(in  milliseconds)  is  then  derived  by  taking  the  corresponding  seek  time  from  the  manufacturer’s 
specifications. 

In  order  to  compute  the  service  demand  for  the  disk  as  well  as  in  order  to  measure  the  elapsed  time  for  the 
overall  disk,  some  assumption  needs  to  be  made  about  die  number  of  seeks  that  are  going  to  be  perfonned 
relative  to  the  number  of  I/Os.  In  the  measured  baseline  case  we  assume  tliat  one  seek  is  performed  for  every 
I/O  operation.  In  practice,  we  expect  tliis  to  be  an  overestimation  of  the  number  of  seeks.  The  V  file  system 
is  extent-based  and  goes  to  great  length  to  ensure  that  logically  contiguous  file  blocks  arc  also  physically 
contiguous  on  the  disk.  'Ihc  benefits  of  this  strategy  arc  moderated  by  the  fact  that  in  the  kind  of  systems  wc 
are  interested  in,  a  large  number  of  files  arc  rather  small.  A  Unix-like  hierarchical  file  system  typically  fosters 
tlic  use  of  many  small  files.  Measurements  of  file  sizes  weighted  by  frequency  of  access  at  the  University  of 
Washington  indicate  an  average  of  approximately  11  kilobytes  142).  Similar  experiments  done  by  Ousterhout 
at  Ikrkclcy  yielded  similar  results  [40].  Therefore,  even  if  tlic  file  system  were  able  to  allocate  all  logically 
contiguous  bliKks  in  a  physically  contiguous  manner  on  tlic  disk,  tlicn  wc  might  still  expect  to  sec  a  significant 
number  of  seeks. 


4.2.4.  Solution  Technique 

In  order  to  solve  the  model,  wc  use  an  approximate  single-class  Mva  (mean-value  analysis)  solution 
technique  [64. 67]  (For  a  good  description  of  the  application  of  Mva  to  queueing  network  problems,  sec  also 
[43J.)  This  technique  has  the  advantage  of  being  relatively  accurate  as  well  as  computationally  inexpensive 
but  it  imposes  some  limitations  on  the  class  of  systems  that  can  be  succcssrully  modeled.  Two  characteristics 
of  our  system  defy  proper  formulation  in  tlic  Mva  framework,  namely  the  load-dependent  service  demand  on 
the  Hthcrnct  and  the  a.synchrony  in  typical  file  system  jiccass,  Wc  briefly  describe  the  approximations  wc  use 
in  order  to  account  for  these  characteristics. 

Asynchrony  cannot  be  modeled  directly  in  the  Mva  framework  because  tliis  solution  technique  requires 
that  tlicrc  be  only  a  single  token  per  request  present  in  the  network.  This  token  travels  around  llirough  the 
network  accumulating  service  at  the  different  service  centers.  Thus,  no  request  can  be  prixcsscd  at  two 
service  centers  at  the  same  time.  If  a  given  request  is  the  only  request  present  in  the  system,  the  response  time 
is  equal  to  the  sum  of  its  service  demands  at  the  various  service  centers.  Measurements  of  file  access,  as 
presented  in  Section  4.3.  indicate  --  as  expected  -  the  opposite:  in  Table  4-1  the  sum  of  tlic  service  demands  is 
higher  than  the  elapsed  time,  indicating  some  amount  ol  parallelism.  This  is  the  result  of  a  number  of  factors, 
including  concurrent  execution  on  the  workstation  and  the  file  server  (as  indicated  by  the  parallelism  for 
interprocess  communication  in  Tables  3-5  to  3-8),  and  file  server  Ci’L'  processing  while  tlic  disk  I/O  operation 
is  in  progress.  Methods  have  been  proposed  for  incorporating  asynchrony  into  Mva  solution  methods |32|. 
Wc  have  chosen  not  to  use  tlicsc  methods  because  they  add  an  extra  class  to  the  model,  resulting  in  a 
significant  increase  in  the  amount  of  compuUition  ncccsxiry  to  arrive  at  a  solution.  Katlicr,  wc  use  the 
following  simple  approximation:  wc  subtract  the  amount  of  conciirrcncy  measured  in  the  single-user  case 
from  the  predictions  of  the  model  under  all  load  conditions.  This  is  clearly  correct  for  tlic  single-user  case  and 
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approximately  true  at  low  loads.  Furthermore,  under  high  load  conditions,  the  cfTcct  of  concurrency  is 
negligible  compared  to  the  delays  introduced  by  queueing. 

The  load-dependent  characteristic  of  the  Ethernet  network  is  accounted  for  by  a  technique  called 
hierarchical  modeling[A'i].  Hereby,  a  low-level  analytic  and  simulation  model  is  used  to  derive  die  service 
demand  of  the  Hthcrnet  in  function  of  its  offered  load  [4].  'Ilus  low-level  model  is  then  integrated  into  the 
overall  queueing  model  of  the  system  as  a  flow-equivalent  (load-dependent)  service  center.  An  extension  to 
the  approximate  Mva  solution  method  is  used  to  take  into  account  the  load-dependent  nature  of  this  service 
center  [80). 

4.2.5.  Response  Time 

Given  the  topology  of  the  network  (as  in  Figure  4-2),  the  service  demands  at  the  various  service  centers  and 
the  customer  workload,  the  Mva  solution  technique  generates  (among  other  things)  the  average  response  time 
for  a  request.  By  lunning  the  model  with  suitably  modified  service  demands,  we  get  die  response  time  for 
modified  versions  of  the  system.  We  argue  that  these  figures  by  themselves  do  not  form  an  adequate  basis  for 
comparing  the  user's  perception  of  the  performance  of  different  file  service  configurations.  The  user’s 
perception  of  system  performance  is  based  on  the  speed  of  his  applications,  liesidcs  file  I/O.  applications 
contain  some  amount  of  computing  as  well.  A  more  adequate  basis  for  comparison  is  therefore  the  sum  of  the 
average  I/O  response  time  (for  a  given  file  service  configuration)  plus  the  average  amount  of  "user  mode" 
computing  corresponding  to  the  amount  of  data  in  the  I/O  request.  For  instance,  if  applications  contain  on 
average  10  percent  I/O  and  90  percent  computing,  a  two-fold  increase  in  die  time  necessary  to  perform  an 
I/O  operation  would  look  quite  detrimental  when  viewed  in  isolation.  However,  when  viewed  in  the  context 
of  the  overall  application,  only  a  10  percent  performance  decrease  would  result  A  number  of  experiments 
were  run  in  a  comparable  environment  to  obtain  an  estimate  of  the  average  amount  of  computing  relative  to  a 
given  amount  of  I/O.  They  indicate  an  average  amount  of  212  milliseconds  of  user  mode  computing  per  8 
kilobytesof  I/O  (42). 


4.3.  Results  from  the  Baseline  Model 


4.3.1 .  Service  Demands  and  Elapsed  Times 

In  l  able  4- 1  we  present  die  results  of  me.asuremcnts  of  service  demands  and  elapsed  limes  for  doing  8 
kilobytes  of  I/O  in  the  baseline  configuration.  This  configuration  consists  of  a  client  running  on  a  SUN 
workstation,  a  file  server  running  on  a  SUN  with  a  ITijitsu  Eagle  disk  and  an  Xylogiesdisk  controller,  both  of 
them  connected  to  a  10  Mb  Ethernet  network. 


Baseline  Configuration 
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4.3.2.  Discussion 

The  numbers  in  Table  4T  allow  us  to  make  the  following  comparisons  between  local  and  remote  single-user 
I/O  performance. 

1.  The  ratio  of  total  service  demands  for  remote  vs.  local  file  access  is  1.78.  This  indicates  that  there  is  a 
significant  cost  to  accessing  files  over  the  network,  when  tlic  service  demands  for  file  I/O  arc  considered 
in  isolation  (without  regard  to  parallelism  or  user  mode  computing).  ITiis  is  true  even  though  we  make 
a  rather  pessimistic  assumption  about  die  disk  (one  seek  per  I/O).  A  more  optimistic  assumption  about 
the  seek  vs.  I/O  ratio  would  further  inflate  the  ratio  of  remote  vs.  local  service  demands. 

2.  Concurrency  is  more  outspoken  in  the  remote  ease  oian  when  the  file  server  is  local.  Tlic  ratio  of 
remote  vs.  local  elapsed  times  of  1.70  is  therefore  somewhat  lower  than  the  corresponding  ratio  of  total 
service  demands. 

3.  When  an  average  amount  of  user  mode  computing  of  212  milliseconds  per  I/O  operation  is  added  to 
the  elapsed  time  for  file  access,  the  ratio  between  remote  and  lixral  drops  to  a  more  acceptable  value  of 
1.15. 

As  argued  before,  we  believe  the  latter  ratio  is  the  correct  figure  of  merit  in  comparing  different  file  service 
configurations.  From  the  previous  arguments,  we  conclude  Uiat  when  the  file  server  is  unloaded,  and  the  file 
server  is  identical  to  the  workstation  (with  respect  to  CPU  and  disk  speed),  the  performance  penalty  for 
accessing  a  remote  file  server  is  approximately  15  percent.  In  die  next  section,  we  consider  the  evolution  of 
this  penalty  as  more  workstations  are  added  to  the  file  server. 

4.3.3.  Effects  of  Congestion 

Figure  4-3  shows  the  effects  of  congestion  as  more  workstations  are  added  to  the  system.  This  graph  has  the 
average  amount  of  user  mode  computing  per  I/O  operation  added  to  die  elapsed  time  for  file  I/O.  With  10 
worksuilions  the  ratio  of  remote  vs.  liKal  is  1.19,  to  be  compared  against  1.15  in  the  single-client  ease  --  a  very 
moderate  increase.  For  30  worksuitions  die  ratio  has  risen  to  1.45  and  starts  to  rise  sharply  dicreaftcr. 

The  results  for  the  baseline  model  under  singlc-u.scr  and  multi-user  access  so  far  allows  us  to  make  the 
following  conclusion.  Assume  we  arc  willing  to  pay  some  pcrrorniance  penalty  for  the  benefiLs  of  a  shared 
remote  file  server  (lower  cost  per  worksuition,  easier  file  sharing,  etc.)  Let  us  somewhat  arbitrarily  fix  an 
upper  limit  on  this  penalty  at  20  percent  above  local  file  access  times.  In  that  ease,  remote  file  access  from  a 
worksuition  to  a  shared  file  server  -  under  die  assumptions  of  the  baseline  model  --  has  acccpuible 
pcrliirmance  as  long  as  the  number  of  worksuitions  is  kept  below  12,  when  viewed  in  die  context  of  the 
average  amount  of  user  mode  priKCSsing  per  I/O. 

We  now  analy/.c  the  reasons  behind  the  deterioration  of  die  performance  when  more  worksuitions  are 
added  to  the  file  server.  In  order  to  do  this,  we  use  die  following  two  intuitive  results  from  queueing  theory: 

1.  Under  low  load,  response  time  is  approximately  equal  to  the  sum  of  the  service  demands  at  the 
individual  service  centers,  llicrcforc,  a  decrease  of  die  service  demand  at  any  service  center  results  in  a 
corresponding  decrease  in  response  time. 

2.  Under  high  load,  response  lime  is  primarily  determined  by  the  t|ucucing  delay  in  front  of  the  service 
center  with  the  highest  service  demand  (the  so  called  boiiicncck  service  cciiicr).  I'liis  queueing  delay  is 
in  turn  primarily  detennined  by  the  service  demand  at  that  service  center.  Fluis  the  most  noticeable 
improvement  in  response  time  results  from  decreasing  the  service  demand  at  the  bottleneck  service 
center. 

Iliese  results  are  easily  explained.  First,  under  low  load,  a  request  cycles  around  the  network,  accumulating 
service  at  the  service  centers  and  only  seldom  encountering  queueing  delays.  Clcaiiy,  the  overall  response 
lime  is  approximately  equal  to  the  sum  of  the  service  demands.  .Second,  as  the  load  on  the  individual  service 
centers  increases,  die  queueing  delays  become  more  pronounced  and  at  some  point  start  to  dominate  tlic 
overall  response  lime.  Consider  the  (extreme)  ease  where  the  hollicncck  service  center  has  a  service  demand 
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Figure  4-3:  Mulli-Clicut  Kile  Access:  Response  I’imc  vs.  Number  of  Clients 

much  higlicr  than  all  the  other  service  centers,  and  assume  the  load  is  so  high  that  this  center  is  entirely 
s;ituralcd  (100  percent  utili/ation).  In  this  situation,  requests  spend  almost  all  of  tlicir  time  in  tlie  queue 
waiting  for  service  at  the  bottleneck  service  center.  The  length  of  this  queue  is  approximately  equal  to  N,  the 
number  of  requests  present  in  the  system.  As  soon  as  a  request  has  finished  processing  at  llie  bottleneck 
service  center,  it  quickly  advances  through  die  tither  service  centers  and  almost  immediately  joins  the  queue 
the  bottleneck  service  center  ag.iin.  I'hc  average  amount  of  lime  .spent  in  this  queue  (per  cycle)  is  equal  to  the 
length  of  the  queue  seen  on  arrival  (approximately  equal  to  N)  times  Uic  average  service  demand.  'Ilius,  since 
this  queueing  delay  forms  the  biggest  part  of  the  »)vcrall  response  time,  it  can  be  seen  tliat  a  decrease  in  service 
demand  at  the  bottleneck  service  center  results  in  an  (approximately)  N-fold  decrease  in  overall  response 
time.  This  result  holds  for  the  case  where  one  service  center  has  a  seiwice  demand  per  request  signiHcantly 
higher  tlian  any  iiiher  and  under  ciimplete  s;iluralion  of  tlie  network.  Under  less  extreme  assumptions,  it 
remains  true  tJiat  a  decrease  in  service  demand  at  the  slowest  service  center  has  tlie  most  beneficial  effect  on 
the  response  lime  under  load:  the  cIVccl  is  less  than  N-fold  but  still  significant. 

As  Table  4- 1  shows,  tlie  bottleneck  service  center  in  the  baseline  configuration  is  the  file  server  Cl'U.  Figure 
4-4  confirms  Uiis  observation:  it  graphs  the  utili/ation  at  the  various  .service  centers  in  function  of  die  number 
of  workstations.  Clearly,  die  file  server  Ct'U  approaches  SiUuration  as  more  workstations  are  added.  Also 
noteworthy  in  Figure  4-4  is  dial  die  utili/ation  of  the  network  is  truly  minimal  (less  dian  15  percent  until  50 
worksLitions  are  present). 

More  important  than  the  exact  shape  of  this  graph,  is  the  observation  that  die  file  server  Cl’U  rather  than 
the  disk  is  die  service  center  with  the  highest  service  demand.  This  is  true  in  spite  of  the  fact  diat  wc  made  a 
pessimistic  assumption  about  die  seek  vs.  I/O  ratio  and  in  spite  of  die  fact  dial  die  V  interprocess 
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Figure  4-4:  Multi-Client  rile  Access:  Utili/jilion  of  Various  Resources 

communication  is  very  efficient  (Measurements  of  a  Unix  4.2  Hsi)  system  indicate  a  service  demand  on  tlic  file 
server  that  is  twice  as  high  |42|.)  Therefore,  it  is  reasonable  to  expect  tliat  for  a  file  server  configuration  with  a 
Ci’U  comparable  to  a  (i80()0  and  a  disk  comparable  to  the  T'ujitsii  Kiiglc  or  faster,  tlic  file  server  CPU  is  the 
bottleneck  service  center.  In  die  latter  part  of  this  chapter  we  arc  mainly  inicrcsicd  in  reducing  the 
perfonnance  degradation  under  high  load.  I'hcreforc.  we  investigate  primarily  those  modincations  that 
reduce  die  ulili/alion  of  die  file  server  CPU. 

Finally,  a  point  needs  to  be  made  about  the  validity  of  extrapolating  die  single-user  measurements  to 
multi-user  ticccss.  In  particular,  .some  factors  seem  to  indicate  that  the  service  demand  per  request  on  the  file 
server  CPU  would  be  less  under  load  conditions.  ITiis  results  from  file  server  processes  being  continually 
active,  rather  than  periodically,  so  diat  they  do  not  need  to  be  blinked  and  unbUx-ked.  Ihis  cflcct  is  difficult 
to  quantify  and  somewhat  dependent  on  the  construction  of  the  file  server.  Additionally,  increased 
contention  for  file  server  bufl'crs  could  potentially  increase  the  service  demand  per  request.  We  further 
ignore  these  effects  and  assume  that  the  service  demands  arc  independent  of  the  load  on  the  system. 


4.4.  Modifications  to  the  Baseline  Model 

In  this  section  we  arc  primarily  interested  in  leducing  tlic  response  time  under  conditions  of  high  load.  The 
different  file  server  configurations  we  investigate  arc  primarily  aimed  at  this  goal,  although  we  also  indicate 
their  effect  at  low  loads.  Next  is  a  list  of  the  modifications  considered  in  this  section. 

1.  Using  a  faster  file  server  processor. 

2.  Increasing  the  request  size  to  the  file  server  in  order  to  reduce  protocol  overhead. 

3.  Disk  caching  on  the  file  server. 
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4.  Introducing  some  form  of  caching  on  the  client  workstation. 

5.  Increasing  the  number  of  file  servers. 

4.4.1 .  Faster  File  Server  Cpu 

l-'igurc  4-5  shows  the  effects  of  a  CPU  twice  as  fast  as  the  68000.  as  predicted  by  the  model.  These  results  are 
obtained  by  reducing  the  service  demand  on  the  file  server  CPU  by  half,  llic  model  shows  significantly 
improved  performance  for  all  load  values,  but  especially  at  high  load.  It  is  not  clear  that  such  a  speedup  could 
in  practice  be  achieved.  The  protocol  used  to  transfer  large  amounts  of  data  (large  with  respect  to  the  network 
packet  size)  is  a  streaming  protocol  without  any  form  of  flow  control  (Sec  Chapter  5).  Specifically,  the  source 
of  die  transfer  sends  out  back-to-back  maximum-size  packets  until  all  data  has  been  transferred,  without 
waiting  for  an  acknowledgement  between  packets.  In  order  to  achieve  high  performance,  the  error  rate  has  to 
be  kept  low.  In  particular,  the  destination  (in  this  case  the  workstation)  must  be  able  to  receive  packets  as  fast 
as  tlie  source  (the  file  server)  can  transmit  them.  Otherwise,  the  worksuition  starts  dropping  packets,  causing 
tlie  file  server  to  retransmiL  with  overall  performance  degradation  as  a  resulL  Clearly,  when  the  speed  of  the 
file  server  is  increased  significantly,  and  it  is  sending  out  back-to-back  packets,  cither  the  processor  on  the 
workstation  has  to  be  upgraded  correspondingly  or  the  network  interface  on  the  worksuition  must  be  capable 
of  buffering  a  larger  number  of  packets.  Such  modifleations  to  the  workstation  might  undo  the  economic 
arguments  for  diskless  workstations.  On  the  other  hand,  introducing  flow  control  into  the  protocol  most 
likely  would  (at  least  partially)  undo  the  performance  benefits  one  hopes  to  acquire  by  using  a  faster  file 
server  CPU. 
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Figure  4-5:  Remote  File  Access:  KfTccts  of  a  Faster  File  Server  CPU 
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4.4.2.  Increasing  the  Request  Size 

Increasing  the  size  of  client-server  interaction  decreases  the  demand  on  the  file  server  by  reducing  protocol 
overhead  as  well  as  reducing  tlie  number  of  disk  commands  tliat  need  to  be  issued.  However,  similar 
observations  as  in  Section  4.4.1  argue  against  increasing  the  request  size  much  beyond  8  kilobytes.  Again,  in 
order  to  avoid  buffer  overflow  and  packet  loss,  it  might  be  necessary  to  include  more  sophisticated  network 
interfaces  in  the  client  machine.  Additionally,  while  the  benefits  of  going  from  1  kilobyte  requests  to  8 
kilobyte  requests  are  very  significant,  the  relative  merits  decrease  as  the  request  size  is  further  increased.  For 
instance,  we  measured  an  elapsed  time  of  410.80  milliseconds  for  reading  8  kilobytes  using  1  kilobyte 
requests.  This  dropped  to  100.99  milliseconds  when  using  8  kilobytes  requests.  A  flirther  increase  of  the 
request  size  to  64  kilobytes  yielded  only  a  modest  improvement  to  95.80  milliseconds.  In  terms  of 
communication  overhead.  Table  3-8  indicates  little  benefit  in  going  much  beyond  8  kilobytes  for  MoveTo 
operations,  both  in  terms  of  elapsed  time  as  well  as  in  terms  of  processor  utili/.alion.  While  in  theory  less 
seeking  would  be  necessary,  thereby  reducing  demands  on  the  disk,  in  the  kind  of  systems  we  are  looking  at, 
such  a  reduction  of  disk  service  demand  would  in  practice  not  be  accomplished.  As  mentioned  before,  in 
these  systems  most  files  are  quite  small,  and  tliercfore  seeks  would  remain  necessary. 

4.4.3.  File  Server  Caching 

Figure  4-5  also  shows  tlic  effect  of  caching  the  disk  on  the  file  server.  The  cache  is  assumed  to  have  a  50 
percent  hit  ratio.  Such  a  cache  reduces  demand  on  the  disk  by  avoiding  to  go  to  the  disk  for  buffored  pages. 
It  also  reduces  service  demand  on  the  file  server  Cl'U  by  reducing  the  number  of  disk  commands  tJiat  have  to 
be  issued.  On  the  other  hand,  cache  management  overhead  increases  tlic  service  demand  on  the  CPU.  llic 
inputs  for  the  model  arc  generated  as  follows.  Disk  service  demand  is  reduced  by  50  percent.  In  order  to 
obtain  the  file  server  CPU  demand,  an  experiment  is  run  whereby  all  requested  pages  reside  in  tlic  cache.  The 
processor  service  demand  is  measured  for  this  experiment.  ITie  file  server  Cpu  demand  used  as  input  to  the 
model  is  tlic  sum  of  the  service  demand  for  the  baseline  case  (including  issuing  the  disk  command)  and  the 
service  demand  for  the  above  experiment,  divided  by  two.  Cache  management  overhead  is  ignored. 

The  model  shows  an  improvement  comparable  to  that  obtained  by  using  a  file  server  twice  as  fast  as  tlic 
68000.  However,  it  docs  not  po.se  the  Siimc  problems  as  using  a  faster  file  server  CPU.  If  c;ichc  management 
overhead  can  be  kept  low.  this  modification  is  very  appealing,  l-or  insuincc.  the  file  server  could  keep 
frequently  used  programs  in  memory,  much  as  many  contemporary  timesharing  systems  do.  tlicrcby 
achiev  ing  a  high  cache  hit  ratio.  At  the  time  of  writing,  not  enough  daUi  could  be  collected  in  order  to  judge 
whether  a  cache  hit  ratio  of  50  percent  is  reasonable.  Clearly,  tlic  hit  ratio  is  strongly  dependent  on  the 
workload  present  to  the  cache  and  tlic  amount  rrf  memory  tliat  can  be  put  to  use  as  file  server  buficrs. 


4.4.4.  Using  a  Client  Cache 

In  addition  to  tlic  advantages  of  a  cache  on  tlic  file  server,  a  cache  of  file  pages  on  a  workstation  machine 
reduces  the  load  on  the  file  server  by  eliminating  the  need  for  remote  file  access  for  pages  that  arc  found  in 
the  cache.  On  tlic  down  side,  maintaining  a  cache  on  tlic  workstation  introduces  cache  management  and 
cache  consistency  overhead.  Fart  of  this  overhead  is  accounted  for  on  the  workstation,  which  is  assumed  not 
to  be  a  source  of  contention,  therefore  its  effects  under  load  arc  minor.  The  caching  ovcrhc.td  rcm.iins  visible 
though  in  the  low-load  elapsed  times.  Some  extra  overhead  would  most  likely  also  be  incurred  on  the  file 
server.  Figure  4-6  shows  die  improvement  in  response  time  for  a  50  percent  cache  hit  ratio.  The  inputs  for 
this  model  were  half  tlic  measured  file  server  CPU.  disk  and  network  service  demand  of  the  baseline  model 
and  identical  client  CPU  service  demand.  Again,  it  is  not  possible  to  judge  from  currently  available  data 
whether  tlic  assumption  of  50  percent  cache  hit  ratio  is  reasonable.  Frcsumably.  a  worksuiiion  cache  could 
also  be  combined  with  a  file  server  cache. 
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I’igurc  4-6:  Remote  F'ile  Access:  Relieving  Kile  Server  l.oad 

4.4.5.  Adding  a  Second  File  Server 

In  the  previous  section,  we  explored  one  way  of  dcccntrali/ing  file  access,  namely  by  introducing  caches  of 
flic  pages  on  the  workstalis)ns.  In  this  section,  we  investigate  an  alternative  way  of  dcccntrali/ing  file  access  by 
considering  tlic  case  of  two  file  server  machines.  We  assume  that  traffic  is  equally  divided  between  the  two 
flic  servers.  Client  Ci’U  and  network  service  demand  thus  remain  tlic  s;imc,  while  for  each  file  server,  the  disk 
and  flic  server  Ci’U  demand  of  the  baseline  model  arc  divided  by  two.  K’igiirc  4-6  shows  the  pcrfonnancc  of 
this  configuration.  Performance  improvement  is  comparable  to  that  achieved  by  introducing  a  client  cache 
with  a  50  percent  hit  ratio.  In  practice,  introducing  a  second  file  server  would  probably  result  in  better 
performance  titan  client  caches  since  we  ignored  cache  consistency  overhead,  if  an  equal  spread  of  traffic  can 
be  achieved. 


4.5.  Chapter  Summary 

We  have  constructed  a  queueing  network  model  of  remote  sequential  file  access  based  on  mcctsurcmcnts  of 
die  V  file  server.  Given  this  model  we  have  investigated  the  performance  of  remote  file  access  under 
increasing  load  on  die  file  server.  We  have  also  explored  how  high-load  performance  is  affected  by  various 
modifications  to  the  file  service  configuration.  We  have  been  able  to  draw  die  following  conclusions  from  this 
modeling  study: 

I.  When  viewed  in  the  context  of  the  normal  amount  of  user  mode  processing,  a  shared  remote  file  server 
is  quite  practical  for  moderate  numbers  of  workstations,  when  one  is  willing  to  tolerate  some  amount  of 
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performance  degradation  in  return  for  die  benefits  of  a  shared  file  server, 

2.  Under  the  conditions  of  the  baseline  configuration  (a  CPU  comparable  to  the  68000  and  a  disk 
comparable  or  faster  to  a  Fujitsu  Eagle),  we  have  shown  that  the  file  server  CPU  is  the  bottleneck  service 
center  for  remote  file  access.  ITiis  is  true  in  spite  of  the  fact  that  we  have  made  rather  pessimistic 
assumptions  about  the  seek  vs.  I/O  ratio  and  in  spite  of  the  fact  that  we  arc  using  communication 
primitives  that  arc  very  efficient  in  terms  of  processor  utili/ntion  on  the  file  server. 

3.  Significant  improvements,  both  in  low  and  high  load  performance,  can  be  derived  from  using  large  size 
interactions  between  the  client  and  the  file  server.  For  a  number  of  reasons,  including  buffering 
limitations,  increased  probability  of  packet  loss  and  the  small  average  file  size,  the  rate  of  improvement 
drops  significantly  when  the  request  size  increases  above  8  kilobytes. 

4.  Among  the  modifications  of  the  baseline  configuration  that  we  evaluated,  introducing  a  file  server  cache 
and  introducing  a  second  file  server  seem  most  promising  in  terms  of  improving  performance  under 
high  load.  ITicy  both  drastically  reduce  file  server  congestion  without  introducing  significant 
disadvantages. 

The  queueing  network  model  docs  not  take  into  account  a  number  of  factors.  In  particular,  it  only  provides 
us  with  a  first  order  approximation  of  the  elapsed  time  distribution,  not  taking  into  account  the  effects  of 
bursty  traffic  patterns.  Simulation  and  further  experience  with  the  file  server  is  necessary  to  judge  whether 
the  predictions  of  the  model  arc  reasonable  approximations  of  reality. 


MILSSAGF  PASSING  ON  A  lOCAI.  NEPWORK 


PROTOCOL  AND  IMPLLMiiN TA TION  I-XPERIENCE 


51 


—  5  — 

Protocol  and  Implementation  Experience 


5.1.  Introduction 

Having  described  the  communication  primitives  of  the  V-System  and  having  studied  in  detail  their 
performance  for  various  applications,  wc  now  turn  our  attention  to  the  protocol  underlying  die  distributed 
operation  of  the  V-System.  'litis  discussion  is  divided  in  two  major  parts.  In  the  first  part  wc  specify  the  rules 
of  the  protocol  to  which  participating  machines  must  adhere  if  they  want  to  be  part  of  the  operation  of  the 
distributed  V-System.  llicse  include  rules  about  alhxration  of  proux;ol  addresses  (process  identifiers),  process 
location,  packet  format  and  valid  packet  sequences.  Motivations  for  the  particular  design  clioiccs  made  are 
outlined.  This  protocol  has  been  implemented  as  part  of  the  V  kernel  to  allow  workstations  to  participate  In 
the  distributed  V-System.  Experience  with  this  implemcnmtion  effort  forms  the  second  part  of  diis  chapter. 

Because  of  its  original  implementation  in  the  V  kernel,  the  protocol  described  in  diis  chapter  is  frequently 
referred  to  as  die  V  inlerkeniel  protocol.  However,  the  protocol  is  rclaUvcIy  independent  of  any  particular 
implementation:  any  piece  of  software  or  hardware  implementing  this  protocol  can  provide  access  to  the 
services  of  a  distributed  V-System  or  part  thereof.  For  instance,  a  prcKess  running  on  a  different  operating 
system  can  implement  the  protocol  and  dicrcby  have  its  clients  and  its  services  become  part  of  the  V-System. 
We  have  currently  one  example  of  such  a  software  package  in  existence:  it  is  made  up  of  a  set  of  Unix 
prtKesses  and  makes  Unix  services  available  to  workstation  users*^. 

Before  suirting  the  discussion  of  the  protocol,  wc  would  like  to  make  a  couple  of  disclaimers  about  the 
contents  of  this  chapter.  First  die  description  of  die  proUKol  and  its  implementation  arc  slightly  idealized 
versions  of  the  real  proUKol  as  it  is  currently  operational.  Scctmd.  wc  omit  any  discussion  of  other  aspects  of 
the  V  kernel.  Wc  refer  the  interested  reader  to  (7|  and  (15J.  Finally,  while  for  the  purposes  of  discussion,  it  is 
nice  to  draw  a  sharp  line  between  a  protocol  definition  and  its  implementation,  such  a  division  often  becomes 
blui  red  in  practice,  as  demands  for  performance  require  environment-specific  optimizations  to  be  made.  Ihc 
experience  with  die  V  proUKol  in  diis  respect  is  no  dilTcrcnt  from  any  other. 


5.2.  The  V  Protocol 

The  discussion  of  the  V  protocol  is  divided  in  four  parts:  allocation  of  V  protocol  addresses,  mapping  from 
V  protocol  addresses  to  underlying  prodKol  addresses,  packet  format  and  valid  packet  sequences.  Wc  denote 
by  kernel  the  entity  diat  provides  the  V  protocol  communication  services  on  a  particular  machine.  Ihis  is 
typically  an  instance  of  the  V  kernel  but  it  might  also  be  some  other  software  package  implementing  the  same 
protocol.  Wc  use  the  term  process  to  denote  any  communicating  entity  in  the  system.  Again,  diis  is  most 
often  a  V  prrKCSs  but  not  ncccsstirily.  I'hc  protocol  addresses  in  the  V  protocol  arc  referred  to  as  process 
uleniifiers.  Next,  wc  di.scu.ss  the  all(x:alion  ofthc.se  pnxrcss  identifiers. 


It  only  iinplciiiciiis  Ihc  server  side  of  Ihc  prolocol.  thereby  providing  access  lo  .servers  on  Unix,  It  diKs  not  intplemcnl  the  client  side. 
Ihcrcrorc,  it  is  not  possible  for  other  Unix  proccs.s&s  lo  become  pan  of  Ihc  V-Sysiem  Ibis  restriction  is  in  no  way  inherent  and  could 
easily  be  removed. 
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5.3.  Process  Naming 

For  the  purposes  of  communicating  with  other  processes,  every  process  needs  to  have  a  process  identifier 
that  is  unique  within  the  V  domain  to  which  it  belongs.  A  V  domain  is  a  set  of  machines  forming  a  single 
instance  of  tlie  V-System.  The  current  implementation  assumes  that  wititin  each  domain  some  form  of  highly 
reliable  broadcast  communication  is  available.  A  typical  V  domain  is  a  (broadcast)  local  area  network. 

For  reasons  of  availability  and  efficiency,  we  want  to  allow  for  distributed  generation  of  pr(x:ess  identifiers, 
while  still  guaranteeing  uniqueness.  FirsL  in  Section  5.3.1,  we  present  a  distributed  algorithm  for  generating 
unique  identifiers  --  not  necessarily  priKess  identifiers  --  in  a  broadcast  domain.  We  analyze  this  algorithm  in 
Section  5.3.2.  Given  this  analysis,  it  becomes  clear  that  the  proposed  method  is  too  costly  for  direct  use  in 
generating  process  identifiers,  given  die  high  frequency  of  this  operation.  Wc  then  present  and  analyze  a 
modification  of  the  original  meUiod  which  is  better  suited  to  our  needs  (Section  5.3.3). 


5.3.1 .  Generating  Unique  Identifiers  in  a  Broadcast  Domain 

In  the  method  proposed  here,  the  individual  kernels  cooperate  to  generate  domain-wide  unique^^ 
identifiei-s  in  the  following  way.  Let  the  size  of  die  unique  identifier  be  n  bits.  F.ach  kernel  maintains  a 
record  of  die  identifiers  it  has  currently  alliKated.  When  a  particular  kernel  wants  to  generate  a  new  unique 
identifier,  it  picks  (at  random)  an  n-bit  string  and  broadcasts  it  over  the  domain.  All  kernels  check  their 
records  to  sec  if  they  have  that  bitstring  already  in  use  as  a  unique  identifier.  If  one  of  die  kernels  has  that 
bitstring  currently  allocated,  it  transmits  a  complaint  to  die  requesting  kernel.  'Iliis  kernel  now  tries  again 
with  a  different  n-bit  string  and  continues  to  do  so  until  it  docs  not  receive  a  complaint  within  a  certain 
timeout  interval  after  it  has  announced  its  choice  of  unique  identifier. 

Clearly,  the  proposed  method  is  not  one  hundred  percent  .secure.  Hither  the  announcement  of  a  new 
identifier  or  a  subsequent  complaint  might  get  lost  by  the  network  and  as  a  result  two  identical  unique 
identifiers  rnigm  coexist.  One  can  reduce  the  probability  of  this  happening  by  executing  multiple  runs  of  the 
algorithm  whereby  a  given  idenlifier  is  retransmitted  for  a  number  of  times.  Nevertheless,  the  failure 
prob.ibilily  remains  non-zero.  Also,  a  complaint  might  be  delayed  beyond  the  timeout  interval  the  requesting 
kernel  is  waiting  for  complaints  and  therefore  go  unnoticed.  Next,  we  analyze  the  probability  of  such  a  failure 
occurring  and  die  cost  of  the  algorithm  both  in  icnns  of  running  time  and  communication  overhead. 


5.3.2.  Probability  of  Failure  and  Communication  Overhead 
5.3.2. 1  Probability  of  Failure 

We  denote  by  p,  the  probability  tliat  two  identical  unique  identifiers  arc  generated  by  tlic  algorithm.  In 
calculating  this  prooability.  wc  make  the  following  as.sumptions 

1.  Packet  transmissions  arc  statistically  independent  events.  Ihc  probability  of  a  packet  not  arriving  at  its 
destination  is  identical  for  all  packets  and  denoted  by  p^^. 

2.  1110  timeout  interval  is  long  enough  .so  that  no  "late"  complaints  have  to  be  considered. 

Ihe  probability  Pp  of  an  undetected  identifier  collision  llicn  bexromes  the  product  of  the  probability  p  of  a 
collision  of  unique  identifiers,  and  the  probability  of  cither  die  announcement  or  the  complaint  being  lost  by 
tile  network.  I  he  probability  p  is  clearly  equal  to 

p  =  m  -f-  2" 

where  m  is  the  number  of  unique  identifiers  in  existence  and  n  is  tlie  size  (in  bits)  of  the  unique  identifier. 
I'hc  probability  of  eitlicr  tbc  announcement  or  die  complaint  being  lost  is 
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Uniqueness  is  denned  here  as  uniqueness  .it  a  particular  point  in  time  ITic  issue  of  preventing  identifiers  to  recycle  shortly  after  they 
have  become  invalid  is  not  addressed  by  this  algorithm. 
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Thus, 


Pf  =  (m-5-2")X(l-(l-pj2) 

An  estimate  must  be  found  for  p^  based  on  the  characteristics  of  the  underlying  network  and  the  probability 
of  buffer  overflow  in  the  network  interfaces'^  Hxperience  with  local  area  networks  indicates  that  this  value  is 
typically  very  low.  If  necessary,  the  probability  of  an  undetected  identifier  collision  can  be  reduced  further  by 
retransmitting  the  announcement  of  the  new  identifier  several  times.  If  we  retransmit  a  times,  and  we  assume 
all  transmissions  are  independent,  the  resulting  probability  Pj.  becomes 

Pf=[(m-2'’)X(l-(l-pj2)l“ 


5. 3. 2. 2  Running  Time  and  Communication  Overhead 

We  compute  two  measures  to  assess  the  cost  of  the  algorithm;  the  expected  mnning  time  and  the  expected 
number  of  pacAe/  evenis  caused  by  one  execution  of  tltc  algorithm.  A  packet  event  is  defined  as  the 
transmission  or  the  reception  of  a  packet.  It  is  indicative  of  the  amount  of  time  the  processors  in  the  system 
spend  in  executing  the  algorithm.  We  assume  all  other  overhead  is  either  negligible  or  else  proportional  to  the 
number  of  packet  events. 


Denoting  by  p(i)  the  probability  of  success  on  die  i-th  try.  by  Tj  the  timeout  period  to  wait  fur  complaints, 
and  by  1'^  the  round  trip  time  for  an  announcement-complaint  sequence,  tlie  expected  running  time  for  the 
algorithm  becomes 

=  i;  X  p(l)  -F  ( -F  i; )  X  p(2)  -F  ( T,  -F  2T  )  X  p(3)  -F  ... 

Assuming  no  packets  arc  lost,  the  probabilities  p(i  -F 1 )  form  a  geometric  distribution  with  parameter  p 

p(i  +  l)  =  p'X(l-p) 

with  p  defined  as  above 


p  =  m  2" 

If  the  probability  of  packet  loss  is  Liken  into  account,  p  needs  to  be  multiplied  by  a  factor  (  1  •  p  )^.  Indeed, 
in  order  to  liave  a  choice  of  identifier  rejected,  the  identifier  has  to  be  in  use  already  (with  probatility  p)  and 
both  the  announcement  and  the  complaint  must  arrive  at  their  destination  (with  probability  (1  -  P„)^). 

Summing  the  above  scries,  we  tlicn  get 

i;-F  i;x(p-F(i-p)) 

It  is  clear  that  for  a  sufficiently  sparse  population  of  the  identifier  name  space  (specifically,  p  T^),  the 

value  off  is  dominated  by  tlic  choice  of  tlic  timeout  interval  T,. 

Similarly,  denoting  by  the  number  of  packet  events  per  unique  identifier,  and  by  K  the  number  of 
machines  prc.scnL  we  obtain 

=  (K  + 1)  X  p(l)  -F  (2K  -F4)  X  p(2)  +  (3K  -F 7)  X  p(3)  -F  ... 
and,  after  substitution  for  p(i)  and  summation. 

C^  =  (K-Fl)  +  (K-F3)X(p-(l-p)) 

We  can  also  derive  the  expected  time  for  a  second  run  of  the  algorithm,  i.c.  when  the  announcement  is 
retransmitted  in  order  to  reduce  Pp  Assuming  the  same  identifier  docs  not  get  allocated  by  another  kernel 
between  the  first  and  the  second  run,  we  get  for  die  expected  time  for  the  second  run 


15. 


One  could  extend  the  .ibovc  rormula  by  usim;  different  failine  prob.nbililiex  p  ^  (,^and  p  for  braadcasl  and  iinicsist  packets.  The 
probability  of  either  the  announcement  or  the  complaint  t’cllint  lost  then  becomes  f  -  (  1  -  ^  ^  '  ^’n  uc  ^ 
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T^=i;  +  PfXTrX(l+p  +  (l-p)) 

This  formula  is  arrived  at  by  combining  the  following  two  observations.  FirsL  with  probability  (1  -  p,)  we 
have  allocated  a  unique  identifier  in  the  first  run.  In  that  case,  there  are  no  complaints  during  the  sccona  run 
and  after  Tj  the  process  terminates.  Second,  with  probability  p~  we  have  alltKatcd  a  non-unique  identifier  in 
the  first  run.  In  this  case,  tlic  original  algorithm  repeats  itself  during  tlie  second  run,  except  for  the  fact  that 
there  is  always  at  least  one  complaint  (since  wc  as.sume  the  identifier  is  not  unique).  The  average  number  of 
announcements  is  thus  identical  to  that  obtained  in  the  first  run  of  the  algoriUim  plus  one.  Summing  the 
contributions  of  both  cases  with  the  appropriate  probabilities,  wc  get 

T„  =  (l-Pf)X  i;-(-PfX(Tj-l-T  X(p-s-(l-p)+  D) 

which  reduces  to 

T„=  i;-(-PfXT,X(l-(-p-s-(l-p)) 

For  the  o-th  retransmission,  a  similar  argument  gives 

Even  more  so  than  for  the  first  run.  the  value  of  T  for  subsequent  runs  is  dominated  by  T^.  'ITiis  is  intuitively 
clear  since  the  probability  of  a  collision  is  reduced  oy  Pj.  for  each  successive  run. 

5. 3. 2. 3  Alternative  Methods 

There  arc  a  number  of  modifications  one  could  explore  with  respect  to  tliis  basic  algorithm.  For  instance, 
one  could  have  the  different  kernels  generate  random  numbers  but  not  do  any  form  of  identifier  collision 
detection.  Clearly,  the  cost  both  in  terms  of  elapsed  time  as  well  as  in  tenns  of  the  number  of  packet  events 
then  reduces  to  zero.  I  he  probability  of  failure  can  be  computed  by  observing  that  the  probability  of  two 
random  numbers  being  equal,  in  a  sample  of  a  from  a  population  of/8  is  approximately  equal  to 

(i-a^pr 

or.  assuming  a<  and  using  the  binomial  approximation. 

The  probability  of  failure  P|.thcn  becomes,  substituting  m  for  a  and  2"  for  j8, 

Pj.  =  m^  -5-  2" 

litis  value  has  to  be  compared  with  the  value  derived  for  Pj.  above  for  the  announcement-complaint 
algorithm.  Whether  it  is  appropriate  for  a  given  application  has  to  be  judged  by  substituting  tlte  appropriate 
values  of  m  and  n. 

Alternatively,  one  could  take  advantage  of  the  fact  that  in  most  cases  the  hardware  provides  a  unique  host 
address  per  machine,  lltc  kernel  could  then  use  this  hardware  address  as  part  of  tlic  unique  identifiers 
generated  by  tliat  kernel.  Of  course,  tliis  can  only  be  accomplished  if  the  hardware  address  (or  some  unique 
substring  thereoO  fits  within  the  size  of  the  process  identifier.  For  instance,  on  a  10  Mb  l•;thcrnct  network 
with  its  48-bit  addresses,  one  would  have  to  use  process  identifiers  at  least  as  large  as  48  bits.  Ihc  mctliod 
would  be  applicable  (and  has  been  used  in  dial  form  in  an  earlier  implementation  of  the  kernel)  on  a  3  Mb 
F.tlicrnet  netwi)rk  with  8-bil  host  addresses. 

5.3.3.  A  Practical  Method  for  Generating  Process  Identifiers 

Using  a  unique  hardware  address  as  part  of  the  prtKCSs  identifier  would  evidently  be  tlie  method  of  choice 
(virtually  zero  overhead,  perfectly  reliable)  if  it  were  not  precluded  by  the  desire  to  keep  the  size  of  tlie 
prtK-ess  identifier  small.  The  meth<id  of  simply  generating  a  unique  identifier  without  any  collision  detection 
leads  to  unaccepuible  uniqueness  characteristics  in  our  environment.  Consider  for  instance  a  system  with  64 
hosts  each  with  an  average  of  16  processes  active;  this  would  produce  an  error  probability  p^-of  approximately 
2.5  X  10'\  Ihis  probability  is  unacceptably  high,  in  particular  if  one  realizes  tliat  it  continues  to  exist  over 
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time.  I’hc  announcement-complaint  mctliod.  as  described  above,  is  by  itself  also  impractical  for  generating 
process  idciuincrs.  Given  the  high  frequency  of  process  creation  and  its  relative  timc-criticalncss,  the  costs  in 
terms  of  elapsed  time  and  packet  events  arc  clearly  too  high.  For  instance,  the  overhead  for  creating  a  process 
"  other  than  generating  its  unique  idcntiFicr  --  is  on  the  order  of  half  a  millisecond.  On  tlic  other  hand,  the 
elapsed  time  for  generating  a  unique  identifier  by  the  above  method  is  bound  from  below  by  tlic  length  of  the 
interval  1  ^  during  which  the  kernel  waits  for  complaints.  T  should  be  at  least  an  order  of  magnitude  bigger 
tlian  1 1  in  order  to  reduce  tlic  probability  of  late  complaints,  making  a  value  of  on  tlic  order  of  100 
milliseconds  appropriate  for  'I  Clearly,  process  creation  with  this  mctliod  of  unique  identifier  generation 
would  be  very  expensive. 

We  remedy  this  situation  by  using  a  traditional  hierarchical  technique  whereby  a  domain-wide  unique 
process  identifier  is  generated  as  a  combination  of  a  domain-wide  unique  identifier  per  machine  and  a 
machine-wide  unique  identifier  per  process.  The  method  of  Section  5.3.1  is  used  to  generate  the  domain-wide 
unique  lo^^ical  host  identifier.  Tliis  domain-wide  unique  host  identifier  is  tlicn  concatenated  with  a  locally 
unique  identifier  to  produce  an  domain-wide  unique  process  identifier.  In  general,  generation  of  locally 
unique  identifiers  can  be  done  trivially  in  comparison  to  generating  domain-wide  unique  identifiers  (for 
instance,  by  uiking  successive  values  of  a  counter).  This  method  has  therefore  die  advantage  that  the 
procedure  for  producing  a  domain-wide  unique  identifier  has  to  be  invoked  only  once  per  machine.  When 
allocating  a  new  process  identifier  on  a  particular  machine,  it  is  then  sufficient  to  invoke  the  priKedure  for 
allocating  locally  unique  identifiers. 

ITie  an.ilysis  of  Section  5.3.2  remains  valid  for  the  generation  of  unique  logical  host  identifiers.  Given  the 
confines  of  .i  n-bit  identifier,  the  division  of  these  n  bits  in  h  bits  for  the  logical  host  identifier  and  n-h  bits  for 
the  liKally  unique  identifier  is  a  compromise  between  speedy  and  efficient  logical  host  identifier  generation 
on  one  hand,  and  the  desire  to  have  a  large  space  of  locally  unique  idenlifiers  on  the  oilier  hand  (to  prevent 
untimely  recycling).  In  Figure  5-1  we  have  set  out  the  probability  p  of  an  identifier  collision,  die  expected 
time  1^  l  ecessiiry  for  generating  a  unique  identifier  and  its  cost  of  packet  events,  in  tenns  of  h,  the 
number  of  hits  allocated  to  the  logical  host  identifier  field.  The  expected  number  of  logical  hosts  is  used  as  a 
p.irameter  in  diesc  curves.  From  diese  curves  it  can  clearly  be  seen  that  the  cost  of  logical  host  identifier 
generation  closely  approximates  its  asymptotic  value  for  values  of  h  such  that  i}'  is  slightly  bigger  than  the 
expected  number  of  logical  hosts. 


5.3.4.  Conclusions 

I  rom  the  analysis  in  this  sectitm,  we  dr.iw  die  following  conclusions: 

1.  Process  identifiers  have  to  be  unique  over  a  V  domain.  I>)main-widc  unique  identifier  generation 
procedures  arc  too  expensive  to  be  invoked  on  every  process  identifier  generation,  given  the  frequency 
and  die  timc-criticalncss  of  the  latter  operation,  riicrcforc,  we  use  a  hierarchical  lechnique  whereby  a 
process  identifier  consists  of  a  domain-wide  unique  logical  host  identifier  concatenated  with  a  locally 
unique  identifier. 

2.  Several  techniques  can  be  used  to  generate  domain-wide  unique  logical  host  idcntifici's.  Using  a  unique 
address  provided  by  die  hardware  would  be  the  mcdiod  of  choice  if  it  were  not  precluded  by 
eonsideralions  regarding  the  relative  si/.c  of  process  identifiers  and  host  addresses.  I'hercforc,  an 
announceiiicnt-compl.iint  algoridim  can  be  used. 

3.  Single  run  announcement-complaint  algorithms  have  some  probability  p^.  of  failing,  the  value  of  Pj. 
being  dependent  on  tlie  population  of  tlic  identifier  space  and  the  error  characteristics  of  tlic  network. 
Hie  cost  of  a  single-run  algorithm  is  in  practice  dominated  by  tlic  timeout  interval  Tj.  The  failure 
probability  can  be  reduced  (with  a  factor  of  Pj.  per  run)  by  using  a  multiple-run  algorithm,  at  the 
expense  of  .mother  in  runtime  for  each  additional  run. 

4.  t-or  efficient  logical  host  identifier  generation,  it  is  sufficient  to  allocate  to  the  logical  host  identifier  field 
h  bits,  such  that  2''  is  slightly  bigger  than  the  number  of  logical  hosts  in  existence. 
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Figure  5-1:  Uniqueness  Characteristics 

In  practice,  we  use  32-bit  process  identifiers.  TTiis  is  iJie  natural  word  size  of  the  machines  currently  used  in 
the  V  environment.  The  high-order  16  bits  of  die  priKess  identifier  serve  as  die  logical  host  identifier 
subfield.  During  initialization,  the  kernel  picks  a  random  16-bit  number  by  looking  at  its  machine  clock'^.  It 
dien  executes  a  single  run  of  the  announcemcni-complainl  algorithm,  which  results  in  satisfactory  uniqueness 
guarantees  for  our  environment.  I'he  timeout  period  1'^  is  set  at  0.5  seconds. 

5.4.  Process  Location 

In  order  to  have  a  message  sent  from  a  priKess  on  one  machine  to  a  process  on  another  machine,  die  kernel 
on  the  sender's  machine  must  format  a  packet  and  instruct  die  underlying  protocol  layer  to  transmit  that 


subtle  pK)int:  the  machine  clock  and  not  a  software  clock,  ’{his  results  in  a  much  more  random  result  aflcr  rebooting,  as  desired. 
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packet  to  tlic  appropriate  machine.  To  do  so,  it  must  translate  the  process  identifier  of  the  target  process  into 
an  address  that  is  meaningful  to  tlic  underlying  protocol  layer.  It  follows  from  our  assumptions  about  a  V 
domain  tliat  die  underlying  protocol  layer  provides  a  broadcast  address,  that  is,  a  special  address  by  which  all 
kernels  can  be  reached.  We  now  discuss  process  locaiion,  this  is  the  method  by  which  a  kernel  translates  a 
process  identifier  into  an  underlying  address.  In  the  current  implementation  of  the  V-System,  the  V  protocol 
is  implemented  on  top  of  a  data  link  proUKoI  and  therefore  die  underlying  protiKoI  addresses  are  referred  to 
as  host  addresses,  and  die  communicating  entities  as  hosts.  However,  this  is  not  essential  to  this  discussion. 


5.4.1 .  Process  Location  by  Broadcasting 

'ITic  simolcst  process  location  strategy  would  be  to  have  none  at  all.  Given  the  fact  that  we  assume  that 
some  form  of  broadcast  is  available  in  each  V  domain,  every  packet  could  be  broadcast.  All  kernels  would  see 
all  messages  and  filter  out  diosc  destined  for  their  processes.  For  our  environment,  we  believe  this  approach 
to  be  impractical,  because  it  requires  all  kernels  to  process  all  messages  (regardless  of  their  destination).  ITie 
cost  C  in  terms  of  packet  events  of  sending  a  mcssiigc  would  be  equal  to  the  number  of  kernels  present  plus 
one  (  iTic  "plus  one"  derives  from  the  fact  that  the  sending  kernel  listens  to  its  own  transmissions,  resulting  in 
2  packet  events  on  the  sending  kernel.) 

C  =  K  +  1 

m 

This  would  result  in  an  unacceptably  high  load  on  the  prtKcssors  for  protocol  handling.  This  approach  might 
become  practical  if  part  or  all  of  the  (network)  interprocess  communication  machinery  could  be  offloaded  to  a 
smart  network  interface  processor,  and  if  the  broadcast  traffic  docs  not  interfere  with  the  operation  of 
machines  in  the  broadcast  domain  that  arc  not  participating  in  the  V-System^^.  One  would  presumably  still 
have  to  worry  about  a  ntimber  of  issues,  including  greatly  increased  queue  lengths  at  the  network  interface, 
and,  as  a  result,  an  increased  probability  of  packet  loss. 

In  order  to  reduce  tlie  overhead  associated  with  network  interprocess  communication,  the  protocol  provides 
a  more  sophisticated  location  prtKcdure  which  attempts  to  deliver  most  messages  by  point-to-point 
communication,  tlicrcby  reducing  to  2  per  message. 

5.4  •?.  Efficient  Process  Locaiion 

Conceptually,  process  loeation  can  be  accomplished  by  providing  a  global  table  map.  mapping  process 
identifiers  into  host  addresses.  For  the  time  being,  we  assume  .wrong  semantics  for  tlic  entries  in  tliis  table, 
that  is,  we  assume  that  a  piiKCss  with  process  identifier  pul  can  only  exist  at  the  host  with  host  address 
maplpid/^.  Clearly,  for  rciisons  ol  availability  and  efficiency,  the  idea  of  such  a  global  table  is  impractical 
from  an  implementation  viewpoint.  In  order  for  kernels  to  be  able  to  make  priKCSS  ItKation  decisions  based 
solely  on  local  information,  we  wish  to  replicate  tlic  map  UtbIc  over  all  kernels.  It  is  not  necessary  for  correct 
operation  that  each  kernel  maintains  a  full  copy  of  the  table.  Indeed,  if  it  docs  not  possess  an  entry  for  a 
particular  pr<Kcss  identifier.  Uic  kernel  can  resort  U)  broadcasting  messages  directed  to  that  process.  However, 
it  is  essential  that  when  an  entry  is  present  for  a  process  identifier,  the  corresponding  host  address  is  corrccL 
i.e.  that  the  process  actually  exists  at  that  host  address  (or  docs  not  exist  at  all).  Otherwise,  the  kernel  would 
erroneously  conclude  that  that  process  identifier  is  invalid. 

Given  the  fact  that  a  particul.ir  kernel  m.iintains  only  an  incomplete  r/rivof  the  mapping  table,  there  is  some 
probability  q  that  the  host  address  of  a  puKcss  c.ui  be  derived  from  tliat  view.  On  average,  then,  the  packet 
event  cost  for  sending  a  message  is 

=  2Xq  +  (K  +  l)X(l-q) 
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lliis  could  poiciiiially  be  .icconiplishcd  by  muhicasiin^:  ibis  is  .addressing  p.ackcCs  to  a  subset  of  all  machines  in  the  broadcast 
domain. 

]g 

Alternatively,  the  table  entries  eould  be  hints. 
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The  probability  q  is  a  function  of  the  size  of  the  view  relative  to  the  overall  size  of  the  name  space  for  process 
identifiers  as  well  as  of  the  strategy  used  to  decide  what  mappings  to  store  in  the  view.  An  Lru  strategy  seems 
appropriate  since  one  expects  some  locality  of  reference  with  respect  to  process  identifiers. 

The  spaee  requirement  in  the  kernel  for  maintaining  a  view  of  size  S,  with  process  identifiers  of  size  n  and 
host  addresses  of  size  a,  is  then 

space  =  SX(n  +  a) 

In  order  to  achieve  a  reasonably  low  average  value  of  it  might  be  necessary  to  maintain  a  relatively  large 
view  of  the  map,  and  the  corresponding  space  demands  m  the  kernel  might  be  significant.  Ihis  situation  can 
be  improved  by  observing  that  the  number  of  different  host  addresses  (machines)  in  a  V  domain  is  typically 
small  (with  respect  to  the  possible  number  of  process  identifiers,  2").  Tlicrcfore,  it  is  attractive  to  divide  up 
the  process  identifier  name  space  into  a  set  of  equivalence  classes,  with  each  equivalence  class  mapping  to  a 
single  host  address.  Then,  only  a  single  entry  per  equivalence  class  is  necessary  in  the  mapping  table.  The 
kernels  deduce  the  equivalence  class  from  the  process  identifier  by  encoding  the  equivalence  class  in  a 
substring  of  the  process  identifier.  All  process  identifiers  with  the  same  encoding  substring  then  reside  on  the 
same  host  (assuming  they  exist  at  all)  and  they  are  all  represented  by  a  single  entry  in  the  map,  mapping  that 
substring  into  a  host  address.  In  this  way.  the  size  of  the  view  a  kernel  has  to  maintain  in  order  to  achieve  a 
certain  value  of  can  be  reduced  significantly.  The  formula  becomes 

space  =  SX(h  +  a) 

where  h  is  the  size  of  the  host  encoding  substring.  The  value  of  S  can  be  chosen  much  smaller  in  this  case 
[  because 

1.  The  size  of  the  name  space  is  much  smaller  (2**  vs.  2") 

2.  The  l(Kality  of  reference  with  respect  to  hosts  is  far  more  pronounced  than  the  locality  of  reference  with 
respect  to  process  identifiers,  resulting  in  a  more  effective  utilization  of  the  table. 

I  In  practice,  tlic  16-bit  logical  host  identifier  subficld  of  the  process  identifier  serves  as  a  host  encoding 

substring.  I  hc  difl'erent  kernels  maintain  a  table  mapping  some  subset  of  tJicsc  logical  host  identifiers  into 
host  addresses.  New  entries  arc  entered  in  the  Uiblc  as  new  mappings  arc  inferred  from  incoming  piKkets^^ 
Old  entries  arc  deleted  on  an  l.RU  basis.  If  a  mess;igc  has  to  be  delivered  to  a  process  for  which  the  table  docs 
not  contain  a  mapping,  the  message  is  broadcast 

^  5.4.3.  Process  Migration 

We  define  transparent  process  migration  as  the  operation  whereby  a  process  moves  from  one  physical  host 
within  the  V  domain  to  another,  in  such  a  fashion  that  processes  that  communicate  with  it  do  not  have  to  be 
aware  of  its  change  in  location.  A  full  discussion  of  process  migration  within  the  framework  of  the  V-System 
is  beyond  tlic  scope  of  tliis  thesis.  The  following  discussion  contains  some  observations,  relating  solely  to  the 
I  subject  of  pnKcss  location^*’. 

If  all  V  messages  were  transmitted  by  broadcasL  as  suggested  in  Section  5.4.1,  then  process  migration  would 
not  cause  extra  problems  with  respect  U)  prrKCss  UKation.  All  kernels  receive  all  messages  and  a  message 
directed  to  a  particular  priKess  is  delivered  without  pniblcm  even  ilThat  process  has  Just  moved  from  one  host 
to  another.  1  lowcvcr,  in  order  to  reduce  tlic  overhead  as.sociatcd  with  interprocess  communication,  we  wish 
I  to  deliver  most  messages  by  point-to-point  communication.  Kernels  must  now  be  able  to  associate  a 

paiiicular  ht)sl  address  with  a  process  identifier.  If  proccs.scs  arc  allowed  to  migrate,  llic  daUihase  containing 
these  associations  must  somehow  be  able  to  reflect  the  changes.  Two  solutions  seem  plausible.  The  first 
solution  consists  of  updating  tlic  database.  Ihis  is  severely  complicated  by  tlic  fact  that  tlic  database  is 
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Incoming  packets  enniain  both  llic  source  prixrcss  idciitincr  and  the  .source  host  address. 

^*^A  full  discussion  of  process  migration  in  the  V-System  will  be  the  subject  of  Marvin  Thcimcr's  forthcoming  I’h  D.  thesis. 
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distributed,  and  more  specifically  (partially)  replicated  over  an  unknown  number  of  kernels.  Broadcasting  the 
update  some  number  of  times  would  provide  good  (statistical)  certainty  tliat  all  tables  have  been  updated. 
The  alternative  solution  requires  changing  the  semantics  of  the  entries  in  the  database  and  treating  them  as 
hints.  In  that  case,  no  attempt  is  made  to  update  the  database  when  a  prwess  moves.  However,  rather  than 
assuming  that  a  process  does  not  exist  if  it  is  not  found  at  die  host  address  corresponding  to  its  table  entry,  the 
kernel  in  that  case  falls  back  on  broadcasting  the  message  (or  falls  back  on  transmitting  a  message  to  another 
location  if  it  receives  some  form  of  redirect  message). 

Both  methods  have  their  costs  and  benefits.  Broadcasting  the  update  a  times  results  in  a  cost  of 

C  „  =  aX(K  +  1) 

move  '  ' 

packet  events  per  process  that  migrates.  Treating  the  mappings  in  the  table  as  hints  causes  tlie  first  message 
from  a  particular  kernel  to  that  process  after  its  move  to  cost 

C^  =  /?  +  (K  +  l) 

in  packet  events  (assuming  p  retransmissions  and  no  redirect  messages).  More  importantly,  it  causes  that  first 
message  exchange  to  last  for  the  normal  message  exchange  time  plus  the  timeout  interval  for  message 
exchanges. 

Also,  establishing  invalidity  of  a  process  identifier  becomes  more  expensive  when  hints  are  used.  When  the 
mapping  is  known  to  be  correct,  it  suffices  to  deduce  the  host  address  using  the  table  and  to  check  the  validity 
of  the  identifier  at  that  host  address.  Ihis  results  in  a  fast  (in)validity  check.  If  the  table  contains  only  a  hint, 
invalidity  is  harder  to  establish.  Indeed,  one  can  now  no  longer  conclude  from  the  non-existence  of  tlic 
process  at  the  hint  host  address  that  the  process  does  not  exist  at  all.  In  the  best  case,  the  kernel  on  that 
machine  is  able  to  provide  some  form  of  redirect  message,  or  in  the  worst  case  we  have  to  fall  back  on  the 
broadcast  mechanism. 

The  fact  that  processes  are  allowed  to  migrate  away  from  the  host  on  which  tliey  were  created  has  a  few 
other  minor  repercussions.  Kirst,  if  this  were  not  to  be  die  case,  all  priKCSses  on  a  particular  machine  would 
have  the  same  logical  host  identifier  subfield  in  their  priKcss  identifier.  This  is  now  no  longer  true,  resulting 
in  a  slightly  more  complicated  check  for  locality.  'Ihis  is  important  since  this  check  is  performed  on  every 
interprocess  communication  operation.  Second,  it  is  no  longer  true  that  all  process  identifiers  of  processes  on 
a  particular  host  arc  allocated  by  a  single  kernel.  If  tliis  were  the  case,  then  .some  subfield  of  the  local  unique 
identifier  could  be  kept  unique  at  any  particular  lime,  resulting  in  a  trivial  hashing  function  into  the  process 
descriptor  Uibic.  l-inally,  if  some  form  of  host  cncmling  is  done,  and  the  encoding  has  strong  semantics,  tlien 
all  processes  with  the  same  host  encoding  must,  by  definition,  be  present  on  the  same  machine  and 
consequently  move  all  at  tlie  same  time. 


5.4.4.  Conclusions 

Given  adequate  hardware  support  in  the  network  interfaces,  prwess  location  by  broadcasting  would  be  the 
method  of  choiec  in  a  broadcast  domain.  Given  the  limiUitions  of  current  hardware,  we  need  to  resort  to 
point-to-point  communication  to  reduce  the  overhead  of  network  interprocess  communication.  We  have 
shown  a  simple  prticedure  whereby  each  kernel  maintains  an  incomplete  view  of  the  mapping  of  process 
identifiers  to  host  addresses  and  rcstirts  to  bnutdeasting  only  when  tlie  mapping  for  a  ivarlicular  priKCSS 
identifier  is  not  locally  available.  In  order  to  improve  the  hit  ratio  as  well  as  to  decrease  the  si/e  of  the  vievv 
each  kernel  must  maintain,  wc  use  host  encoding  in  the  process  identifiers.  We  have  documented  tJic 
repercussions  of  point-to-point  communication  with  respect  to  priKCSs  migration.  Two  solutions  were 
suggested;  eitlier  updating  the  views  of  tlie  difl’erent  kernels  or  treating  tlie  entries  in  the  views  as  hints. 


5.5.  Packet  Layout 

I'he  inlerkerncl  packet  consists  of  two  main  portions:  a  fixed  size  header  portion  containing  protocol  related 
information  and  the  message  itself,  and  a  variable  size  data  portion,  meant  to  carry  the  data  in  a  MoveTo  or 
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MoveFronu  or  the  data  appended  to  a  Send  or  a  Reply.  '1110  size  of  the  header  is  68  bytes^'.  The  size  of  the 
data  portion  may  range  from  zero  bytes  to  the  maximum  allowed  by  tlie  given  network. 

The  header  and  message  portion  are  fixed  length.  Not  all  the  fields  are  used  for  every  packet.  However, 
the  fixed  size  allows  for  efficient  handling  of  network  prickets  through  Oma  network  interfaces  (See  Section 
5.7.3).  The  ease  with  which  a  Dm  a  transfer  can  tlius  be  set  up  offsets  the  disadvantage  of  having  to  put  some 
number  of  extra  bytes  on  the  network  for  every  packet.  For  instance,  for  a  page  write  of  1024  bytes,  the  fixed 
header  size  requires  28  extra  bytes  to  be  sent  resulting  in  an  extra  overhead  of  60  microseconds  (on  the  10Mb 
network  using  a  3-Com  interface)  on  a  total  of  7.5  milliseconds.  For  a  MoveTo,  there  are  36  extra  bytes  in 
every  packet.  For  a  64  kilobyte  MovcTo,  this  results  in  an  extra  overhead  of  approximately  5.2  milliseconds 
on  a  total  of  200  milliseconds. 

ITic  detailed  packet  format  is  shown  in  Figure  5-2.  Most  fields  have  the  same  meaning  for  all  packet  types; 
they  arc  listed  below.  A  number  of  fields  have  specific  meanings  depending  on  the  packet  type;  they  are 
discussed  as  the  individual  message  sequences  are  examined  (Sec  Section  5.6). 

packetType  Indicates  the  type  of  operation  being  performed.  Possible  values  arc  logicalHostRequest, 
logicalHostConqtlaint,  remoteSend,  remote  Reply,  remoteForward,  nAck,  replyPending, 
remoteGetPid,  reinoteGetPidReply.  remoteMoveToRequest,  remoteMoveToRepfy, 
remoteMoveFromRequest  and  remoteMoveFromReply. 

sequenceNumber  Sequence  number  of  the  message  exchange.  Tlic  sequence  numbers  need  only  be  unique 
relative  to  each  sending  process,  although  in  practice  it  is  unique  relative  to  each  network 
node. 

sourcePid  Process  identifier  of  the  process  that  executed  the  corresponding  operation. 

destinationPid  Process  identifier  of  the  destination  process;  zero  in  the  case  of  remoteGetPid. 

fonvarderPid  Process  identifier  of  the  process  that  ftirwarded  this  packet. 

userNumber  User  number  of  tlic  source  process.  This  is  currently  unused.  It  is  intended  to  be  used  for 
security  and  authentication  purposes. 

length  I  .ength  in  bytes  of  tlic  data  segment  appended  U)  tliis  packet. 

totalLength  Total  length  of  tlic  data  segment  of  which  tliis  packet  is  part. 

localaddress  Originating  address  of  the  segment  data. 

remotcaddress  Destination  address  of  the  segment  data. 

The  V  protocol  packet  is  to  be  encapsulated  in  the  packet  format  of  the  underlying  protocol.  It  is  assumed 
that  tliis  underlying  protocol  provides  some  form  of  type  field  by  which  it  is  possible  to  distinguish  V  packets. 
Current  implementations  of  die  V  proUKol  arc  built  on  tlic  3  Mb  and  tlic  10  Mb  Ktlicrnct  data  link  protocol. 
Consequently,  V  packets  arc  encapsulated  in  Fthcrnct  data  link  packets  for  transmission. 


^'noI  counting  any  nclwork-lcvcl  or  data  link-level  encapsulation 
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5.6.  Packet-Level  Communication 

5.6. 1 .  Remote  Message  Communication 

lliis  section  discusses  the  packet  exchanges  necessary  to  perfonn  a  remote  Send-Receive-Reply  sequence. 
We  first  describe  tlic  packet  sequence  under  normal  conditions.  Hxceptional  conditions  arc  discussed  in  the 
next  paragraph. 

5.6.1 .1  Normal  Sequence  of  Packets 

’nic  kernel  on  the  sending  machine  transmits  a  remoteSend  packet  on  the  network,  cither  directly  to  the 
machine  on  which  tlic  destination  process  resides  or  else  by  means  of  a  broadcast  packet.  The  relevant  fields 
in  the  packet  (Sec  Figure  5-2)  arc  sequenceNumber,  sourcePid,  destinationPid,  fonvarderPuP  and  message. 
ITic  ler^th  field  is  zero.  Assuming  the  receiving  kernel  has  a  message  buffer  available,  it  accepts  the  incoming 
message  and  queues  it  for  the  destination  process.  'Ihis  docs  not  cause  any  acknowledgement  packet  to  be 
sent  at  this  time.  When  the  receiving  process  replies  to  the  message,  a  remoteReply  packet  is  sent  with 
sequenceNumber  identical  to  that  of  the  original  remoteSend  packet,  and  furthermore  with  the  values  of 
sourcePid,  destinationPid  and  message  set  to  the  appropriate  values.  The  length  field  is  again  zero.  Again, 
there  is  no  explicit  acknowledgement  to  the  remoteReply  packet 

5.6.1 .2  Exceptional  Conditions 

In  order  to  deal  with  lost  packets  or  deceased  hosts  and  processes,  the  protocol  provides  a  rctransmis.sion 
strategy.  We  first  discuss  the  sending  kernel's  strategy  and  then  the  receiver’s. 

When  the  original  remoteSend  packet  is  fired  off,  the  sending  kernel  starts  two  timers  asscxriatcd  witli  this 
mcs.sagc.  a  reiraiisinission  timer  with  interval  and  a  liineoui  timer  with  interval  Tj.  If  the  retransmission 
timer  expires,  a  remoteSend  packet  identical  to  tlic  original  is  retransmitted  and  tlic  retransmission  timer  is 
restarted.  If  the  timeout  timer  expires,  tlie  destination  process  is  deemed  dead  or  unreachable,  ilie  mes.sagc 
exchange  is  terminated,  and  an  error  code  is  returned  to  the  sending  prticcss. 

When  a  remoteSend  packet  comes  in.  the  kernel  compares  tlic  {sourcePid,  sequenceNumbei)  pair  in  the 
incoming  packet  against  the  list  of  such  pairs  of  mcss;igc  exchanges  currently  in  progress.  If  tlic  packet  is  a 
retransmission,  the  kernel  discards  the  packet  and  sends  back  a  reply  Pending  packet.  It  also  sends  back  a 
replyPending  packet  if  it  is  forced  to  discard  a  new  mcs.s;igc  because  no  buffers  arc  available.  Such  a 
replyPending  packet  causes  the  retransmission  timer  and  Uic  timeout  timer  to  be  restarted.  This  prevents  a 
Send  from  timing  out  due  to  temporary  un.ivailability  of  buIVcrs  or  due  to  lengthy  processing  at  the  receiver. 

Additionally.  mcs.sagcs  can  he  marked  as  being  idempotent  or  non-idempotent  requests.  For  an 
idempotent  request,  all  notion  of  the  message  exchange  is  discarded  by  the  receiver's  kernel  as  soon  as  the 
Reply  is  executed.  If  the  remoteReply  packet  gets  lost  or  if  a  retransmission  of  a  remoteSend  occurs  in  parallel 
with  the  transmission  of  the  remoteReply  packet,  this  can  cause  the  mcssiigc  to  be  delivered  twice  (or  more). 
Since  the  operation  is  marked  as  idempotent,  this  should  not  cause  any  harm.  If  the  message  is  marked 
non-idempotent,  the  kernel  keeps  around  the  Reply  mcssiigc  for  an  amount  of  time  at  least  as  big  as  the 
retransmission  time  1'^.  If  a  retransmission  of  the  WMorc’.Ve/«f  packet  comes  in.  tlic  message  is  not  delivered  to 
the  receiver  but  the  Reply  is  returned  insie.id.  The  next  mcss;igc  from  the  same  process  causes  the  Reply  lo  be 
deleted  as  well,  since  tliis  indicates  that  the  message  exchange  terminated  on  the  sender's  end  (citlicr  by 
receiving  the  Reply  or  by  timing  out).  This  next  mcss;igc  then  effectively  functions  as  an  acknowledgement  of 
the  renwteReply  packet. 

When  a  segment  needs  to  be  appended  to  the  Send,  tlic  corresponding  remoteSend  packet's  length  field 
indicates  the  size  of  the  data  segment  appended  to  the  mcssiigc.  The  Nime  applies  for  a  remoteReply  with  a 
segment  appended.  Additionally,  for  tlic  latter  the  renwtcAddrcss  field  indicates  the  origin  of  the  segment  in 
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the  sender's  address  spaee  where  the  data  is  to  be  puL 

Finally,  when  a  process  executes  a  Forward  on  a  message  coming  from  a  remote  process,  the  kernel  sends 
back  a  remote  Forward  packet  to  the  kernel  of  the  original  sender.  A  remoteForward  packet  is  identical  to  a 
re#norc/?ep/j»  packet  except  for  the  packet  type.  The  sender’s  kernel  then  transmits  a  new  mnore^e/uf  packet, 
identical  to  the  originj)!  one,  except  for  the  destinationPid  and  forwarderPid  fields  which  arc  modified 
appropriately. 

5.6.1 .3  Examples 

Tlius,  in  a  typical  scenario,  a  remoieSend  and  a  remoteRepfy  packet  arc  exchanged  for  a  Send-Receive-Reply 
exchange.  When  the  receiver  dtxs  not  Reply  within  the  time  interval  P .  tlic  remoteSend  is  retransmitted  and 
a  replyPending  is  sent  back  in  return  at  time  intervals  1'^  until  the  Reply  is  sent  If  llic  receiver  docs  not  exist,  a 
remoteSend  packet  is  transmitted  a  number  of  times  at  intervals  T  until  the  timeout  interval  T  expires.  As  a 
performance  optimization  for  this  case,  if  a  kernel  is  present  on  tne  target  machine,  this  kern^  sends  back  a 
nAck  packet  to  quickly  terminate  the  message  exchange.  Finally,  if  a  packet  gets  lost,  citlicr  die  remoteSend 
or  the  remoteReply,  the  remoteSend  \s  retransmitted,  causing  the  message  exchange  to  be  rccxccutcd. 

5.6.1 .4  Discussion 

llic  protocol  described  above  is  an  instance  of  a  reliable  message  transport  protocol.  The  underlying  layer  is 
an  unreliable  datagram  protocol.  The  gap  between  these  two  protocols  is  typically  bridged  by  a  Par  (positive 
acknowledgement  and  retransmission)  strategy.  I'hc  sender  transmits  a  packet  until  it  is  acknowledged  by  the 
receiver  and  so  docs  the  receiver  for  tlie  reply  packet,  leading  to  four  packets  for  tlic  overall  exchange. 
However,  due  to  the  request-response  nature  of  the  V  intcrkernel  protocol,  i.e.  due  to  the  fact  that  every 
request  has  a  response  asstKiated  with  it,  it  is  possible  to  reduce  tiie  number  of  packets  in  the  above  sequence 
to  two.  'Hiis  is  accomplished  by  having  tlic  reply  message  fiinction  as  the  acknowlcdgcmei't  to  tlic  original 
request. 

It  is  also  necessary  for  the  sender  to  periodically  check  whether  the  receiver  is  still  alive,  causing  two  packets 
to  be  exchanged  at  regular  intervals  I'l..  I’hc  advantages  of  the  request-response  nature  of  the  protocol  arc 
thus  most  significant  when  tlic  message  exchange  is  short.  For  instincc.  if  the  incsstige  exchange  ctimplctcs 
within  I'l.,  tlicn  only  two  packets  arc  ncccsstiry,  as  opposed  to  four  when  each  individual  packet  were 
acknowledged.  The  main  drawback  of  this  strategy  steins  from  the  fact  that  the  remotcRcply  is  not  explicitly 
acknowledged.  When  the  remoteReply  is  lost,  the  remoteSend  gets  retransmitted.  In  tlic  case  of  an 
idempotent  request,  tliis  causes  the  whole  operation  to  be  reexecuted.  For  non-idcmpolent  requests,  it 
requires  that  state  about  the  request  (the  reply  messjige  and  its  sequence  number)  be  kept  around  for  a  longer 
time  tlian  wiiuld  typically  be  ncccs.sary  if  tlic  re/norc/?e/>fy  were  explicitly  acknowledged. 

I'hc  choice  of  1’^  is  governed  by  a  compromi.se  between  speedy  detection  and  correction  of  communication 
failures  on  one  hand  and  the  desire  not  to  expend  large  amounts  of  pnxressor  time  (and  to  a  lesser  extent 
network  bandwidth)  in  unncccs.sary  retransmissions.  Assuming  independent  failures,  the  number  of 
retransmissions  forms  a  geometric  distribution  with  parameter  q.  whereby  q  is  llic  probability  of  a  message 
exchange  completing  without  lost  packets 

q  =  l-(l-pj2 

The  expected  value  and  the  standard  deviation  for  the  message  exchange  time  is  tlicn 

M  =  To-l-(To-bTr)X(q-5-(l-q)) 

a  =  (T(,-(-T^)X(q'/4-5-(l-q)) 

where  !  „  is  the  message  exchange  time  when  there  is  no  failure.  For  typical  values  of  p^,  which  arc  on  the 
order  of  lO  '’  for  a  local  network,  the  retransmission  interval  I’l.  has  negligible  effect  on  the  expected  time  but 
is  the  dominant  term  in  tlic  standard  deviation. 

Arguing  against  a  small  retransmission  interval  is  the  desire  to  reduce  overhead  associated  with  network 
interprocess  communication.  I’or  insuince,  consider  a  messtige  conUiining  a  request  whose  pr(x:essing  hikes  T 
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seconds.  'Ilicn,  the  number  of  retransmissions  of  tliis  message  is  T  T^.  Retransmissions  of  the  message 
cause  the  prtKCSsing  of  the  request  to  be  interrupted  for  receiving  the  retransmission  and  generating  the 
replyPending  packet.  In  the  current  implementation,  this  requires  slightly  over  a  millisecond  in  processing 
time.  Thus,  due  to  retransmission  handling,  the  actual  elapsed  time  for  the  request  becomes 

T+  10'^X(T-5-T^)  =  TX(l  +10’^-PT^) 

For  a  retransmission  interval  of  100  milliseconds  for  instance,  the  elapsed  would  be  inflated  by  one  percent 

Due  to  the  low  error  rates  and  due  to  the  very  efficient  coding  of  tlic  primitives,  the  retransmission  interval 
has  a  relatively  limited  cflect  on  the  message  pas.sing  efficiency.  It  h;is  only  a  second  order  effect  on  the 
elapsed  time  due  to  the  low  error  rates,  and,  due  to  the  very  efficient  coding  of  the  primitives,  it  has  only 
limited  effect  on  the  proccs.sor  utilization.  We  are  currently  using  a  retransmission  interval  of  2.5  seconds. 
'Ilic  reason  for  tlic  long  retransmission  interval  is  primarily  the  desire  to  communicate  with  the  V  simulator 
running  on  a  loaded,  timeshared  Vax/Unix  system. 

5.6. 1.5  Idempotency  and  At-Least-Once  Semantics 

An  importart  consideration  in  tlie  design  of  a  reliable  transport  protocol  is  the  choice  of  delivery 
characteristics  guaranteed  by  the  protocol  in  the  face  of  communication  and  machine  failures. 

Let  us  first  restrict  the  discussion  to  communication  failures.  A  relatively  simple  strategy  is  to  provide 
(ii-leasi-once  semantics.  Ihe  prot(x:ol  hereby  guarantees  that  if  a  reply  is  returned  to  the  sender,  the  message 
has  been  delivered  at  least  once.  This  is  simply  accomplished  by  retransmitting  the  message  some  number  of 
times  until  a  reply  is  received.  If  no  reply  is  received  in  time,  an  indication  of  no  delivery  is  rcturned^^. 

A  more  sophisticated  strategy  is  cxactly-once.  In  this  ease,  a  reply  indicates  that  the  message  was  delivered 
exactly  once,  with  again  an  indication  of  failure  if  no  reply  is  received  within  some  interval.  On  top  of 
periodically  nitransmitting,  as  for  at-least-once.  cxactly-once  requires  mainUiining  a  history  of  the 
idcnlification  numbers  (typically  a  source  process  identifier  and  a  sequence  number)  of  received  mcss;\ges  and 
suppressing  duplicates  by  compating  the  identification  number  of  incoming  messiigcs  against  the  history. 
Additionally,  it  requires  that  replies  and  identification  numbers  be  kept  around  for  some  amount  of  time  after 
the  Reply  has  been  executed  until  it  is  certain  that  no  retransmissions  will  come  in.  It  is  nccessiiry  to  keep 
around  this  information  in  order  to  be  able  to  return  the  reply  when  a  retransmission  comes  in. 

The  V-Syslem  .supports  both  at-least-once  and  cxactly-once  semantics;  the  .sender  indicates  in  tlic  mcs.sagc 
the  desired  semantics.  Cedar  Ki’C  supports  exactly-once  semantics  |9).  I'he  cost  of  providing  cxactly-once 
semantics  primarily  results  from  the  need  for  the  kernel  to  make  a  copy  of  the  Reply  in  order  to  keep  it 
around  for  potential  retr.insmission.  I  he  amount  ofdaUi  associated  with  a  Reply be  quite  subsUintial,  for 
instance  in  the  case  of  a  page  read  operation.  Not  supporting  cxactly-once  communication  semantics  requires 
the  applications  to  implement  idcmpdieiii  operations,  i.c.  operations  whose  sidc-elTects  and  return  values  are 
independent  of  whether  they  were  invoked  once  or  more.  More  experience  is  needed  to  make  a  well-founded 
judgement  about  this  trade-off.  On  one  hand,  most  applications  desire  cxactly-once  semantics.  It  is  therefore 
appealing  to  implement  it  once  in  llie  communication  layer  and  then  provide  it  as  a  service  to  applications. 
On  the  other  hand,  lor  certain  applications,  implementing  idempoteiU  operations  is  far  less  expensive  than 
using  an  cxactly-once  conimunic.ition  primitive  with  its  need  for  an  extra  copy.  For  instance,  a  file  system 
page-level  read  can  be  m.ide  idempotent  without  any  extra  elTort:  Ihe  page  number  provides  a  unique 
identifier  that  can  be  used  for  detecting  duplicates  and  tlic  data  associated  with  the  read  is  always  available 
later  since  it  is  stored  in  the  file  system. 

Neither  the  V-System  nor  Cedar  Rl’C  provide  any  guarantee  of  delivery  across  machine  crashes.  It  is 
assumed  that  clients  explicitly  rebind  to  servers  and  execute  some  appropriate  protiKol  to  recover  from 


nu-  indic.ilion  of  tion-dclivcry  is  only  roricfl  with  hith  prob.ibiliiy  Ibc  Iasi  transmission  of  the  mcss.igc  could  have  succeeded  and 
the  reply  to  ih.il  incss;irc  could  base  laded  or  could  have  been  delayed  beyond  the  timeout  interval  Ihe  reiransniission  and  timeout 
strategy  should  be  such  that  this  possibdily  is  esccedingly  unlikely. 
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machine  failures.  In  contrast.  Liskov’s  RPC  guarantees  what  is  called  at-most-once  semantics  in  [48],  even 
across  crashes.  When  a  reply  is  returned,  it  is  guaranteed  that  the  message  has  been  delivered  exactly  once. 
Otherwise,  an  indication  of  failure  is  returned  and  it  is  guaranteed  that  the  message  has  had  no  effect  This 
makes  the  remote  message  invocation  effectively  a  multi-machine  transaction,  with  associated  commit  and 
abort  protocols.  In  particular,  to  survive  machine  crashes,  message  transactions  have  to  be  recorded  on  stable 
storage.  ITie  cost  of  doing  so  is  quite  clear,  at  least  with  today’s  technology.  Given  a  cost  of  2  to  3 
milliseconds  per  message,  and  a  cost  of  on  the  order  of  20  milliseconds  to  access  stable  storage  (a  pair  of  disks, 
for  instance),  the  overall  cost  of  a  message  transaction  increases  by  an  order  of  magnitude.  Development  of 
fast  stable  storage,  such  as  battery  backed-up  core  memory,  could  drastically  alter  the  economics  of  this 
situation. 

5.6.1 .6  Conclusions 

For  remote  message  exchanges,  we  use  a  protocol  that  takes  advantage  of  the  low-error,  low-latency 
characteristics  of  local  networks  and  of  the  expectation  that  the  duration  of  most  message  exchanges  is  short 
(relative  to  the  retransmission  interval).  Under  th.ise  as.sumptions  of  low  error  rate,  low  latency  and  short 
message  exchanges,  the  actual  value  of  retransmission  interval  used  has  only  secondary  effects  on  the 
perceived  performance  of  message  passing,  both  in  elapsed  time  and  in  processor  utilization.  Also,  we  have 
shown  the  implications  of  different  reliability  characteristics  of  message  delivery.  In  particular,  we  have 
identified  the  significant  cost  of  cxactly-oncc  semantics  and  explored  tlie  alternative  of  providing  idempotent 
operations. 

5.6.2.  Remote  Data  Transfer 

5.6.2. 1  Error-Free  Packet  Exchanges 

When  executing  a  MoveTo.  the  kernel  breaks  up  the  total  segment  in  a  number  of  maximally-sized  packets. 
It  transmits  these  in  a  sequence  of  remotcMoveToRequest  packets  without  ever  waiting  for  an 
acknowledgement.  Iksidcs  sourccPUl,  dcstinationPid i\nd  sequenceNumber  the  relevant  fields  in  the  packet 
are  lenfith,  totalLength,  localAddress,  and  remoteAddress,  length  contains  the  length  of  the  data  segment 
appended  to  this  particular  packet,  totalLength  indicates  tlie  total  length  of  the;  MoveTo  operation. 
localAddress  indicates  the  start  address  of  the  remote  data  segment,  while  remoteAddress  indicates  the 
destination  address  of  the  daui  in  tliis  particular  packet.  When  the  whole  sequence  of  packets  has  arrived  at 
the  destination  machine,  the  kernel  on  the  destination  machine  sends  back  a  rcmoteMoveToReply. 

MoveFrom  works  in  a  similar  fashion  except  that  the  kernel  executing  the  MoveFrom  sends  a  single 
renwteMovcFromRcquest  packet  to  the  destination  kernel  witii  totalLength  indicating  the  total  length  of  the 
MoveFrom.  localAddress  contains  the  address  of  where  die  pnxrcss  expects  tlic  data  and  remoteAddress 
indicates  wlicrc  it  is  to  come  from.  In  response,  tlic  destination  kernel  breaks  up  tlie  segment  in  maximally- 
sized  remoteMoveFromReply  packets  and  transmits  them  to  the  source  kernel  without  ever  waiting  for  an 
acknowledgement.  The  relevant  fields  in  the  remoteMoveToReplypackcls  are  length,  indicating  tlie  length  of 
the  data  segment  appended  to  tliis  packet  and  w/iorc/frfrfrcssindicating  where  to  put  tlie  data  in  this  packet  in 
tlie  address  space  of  the  pnxrcss  executing  the  MoveFrom. 

5.6.2.2  Dealing  withErrors 

In  order  to  deal  with  lost  packets,  a  number  of  different  strategies  can  be  considered.  In  the  current 
implementation,  the  destination  kernel  ((f  the  MoveTo  oxdy  sends  an  acknowledgement  when  it  has  received 
all  packets.  If  the  sender  does  not  receive  an  acknowledgement  within  a  retransmission  interval  T^,  it 
retransmits  the  whole  sequence.  Given  the  low  error  rates  of  ItKal  area  networks,  full  retransmission  on  error 
introduces  only  a  slight  performance  degradation.  However,  full  retransmission  can  cause  a  MoveTo  to  fail 
repeatedly  if  back-to-back  packets  arc  consistently  being  dropped  by  the  receiver.  Also,  full  retransmission  to 
deceased  processes  causes  substantial  amounts  of  processor  time  and  network  bandwidth  to  be  wasted.  In 
order  to  avoid  these  problems,  we  arc  investigating  a  more  elaborate,  selective  retransmission  strategy.  'ITic 
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receiver  waits  until  it  sees  the  last  packet  in  the  sequence.  At  this  poinL  it  returns  a  bit  vector^**  indicating  the 
packets  which  it  successfully  received.  The  sender  tlicn  retransmits  the  missing  packets.  If  no 
acknowledgement  is  received  within  T^.  the  last  packet  in  the  sequence  is  retransmitted,  thereby  eliciting  an 
acknowledgement  from  the  receiving  machine. 

5.6. 2.3  Discussion 

In  this  section  we  analyze  the  expected  time  and  the  standard  deviation  for  executing  a  MoveTo  operation 
under  different  transmission  strategics.  MoveTo  operations  arc  different  from  other  forms  of  large-scale  data 
transfer  that  have  been  analyzed  in  that,  by  definition,  the  recipient  of  the  data  has  sufficient  buffers  allocated 
to  receive  the  data.  Additionally,  receipt  of  packets  belonging  to  a  MoveTo  is  handled  with  the  highest 
priority  (at  the  network  interrupt  level)  and  is  as  a  result  not  slowed  down  by  prtKCSS  scheduling  delays. 
These  two  factors,  combined  with  the  fact  that  the  speed  of  the  sender  and  the  speed  of  the  receiver  arc  more 
or  less  matched,  allow  the  MoveTo  operations  to  be  implemented  in  a  way  that  takes  maximal  advantage  of 
the  high  speed  of  local  area  networks  without  worrying  about  time-consuming  flow  control  mca.surcs.  As 
such,  the  protocols  presented  here  represent  techniques  for  achieving  ncar-to-optimal  performance  on  a  local 
network.  In  this  analysis,  we  again  assume  that  packet  transmissions  arc  statistically  independent  events 
which  can  fail  with  probability  p^.  Since  some  of  the  derivations  in  this  section  arc  quite  lengthy,  we  state  the 
results  of  the  analysis  up  front  in  the  next  paragraph.  The  derivation  of  these  results  follows. 

We  arc  considering  a  number  of  different  transmission  strategics,  including 

1.  Stop-and-wait:  acknowledge  every  packet 

2.  Streaming  (a  single  acknowledgement  for  all  packets)  with  full  retransmission  on  error,  with  or  without 
negative  acknowledgement 

3.  Streaming  with  selective  retransmission. 

We  show  that  for  error  rates  typical  of  local  area  networks,  the  expected  lime  for  a  MoveTo  under  a  given 
transmission  strategy  is  almost  exclusively  dependent  on  the  elapsed  time  for  tliat  strategy  when  no  errors 
(Kcur,  and  relatively  independent  of  die  rctransmis.sion  strategy  used,  riicreforc,  any  strategy  diat  is 
suboptimal  in  the  no-failure  ease  also  has  suboptimal  performance  under  realistic  l:x;al  network  operating 
conditions.  This  is  not  unexpected  given  die  low  error  rate;  our  contribution  is  dial  we  quantify  this  intuitive 
insight.  Second,  we  show  diat.  again  for  typical  liKal  network  cnor  rales,  the  standard  deviation  is  dependent 
on  the  rctransmi.s,sion  strategy,  in  particular  on  the  amount  of  data  retransmitted  and  die  retransmission 
interval.  The  cITccts  of  various  modifications  to  die  retransmission  strategy  arc  quantified. 

5. 6. 2. 4  Expected  Time 

In  this  section  we  analyze  two  strategics  that  have  significantly  different  no-failure  characteristics,  namely 
stop-and-wait  and  streaming  with  full  retransmission  on  error  and  no  negative  acknovlcdgcmcnl.  We  show 
that  for  typical  local  network  error  rates,  both  strategics  have  the  same  expected  time  as  in  the  no-fiiilurc  ease. 
T  herefore,  evidently,  the  streaming  strategy  shows  much  belter  results  than  the  stop-and-wait  strategy. 

The  analysis  of  the  stop-and-wait  proUKol  is  very  similar  to  the  analysis  for  message  exchanges.  ITcnoting 
by  T'(l))  ( T ( 1 ) )  the  time  ncccssiiry  for  a  l>packct  ( 1  -piickcl)  transfer,  we  iibviously  have 

T  (D)  =  1)  X  1(1) 

The  probability  of  a  1-packcl  exchange  failing  is 

q  = 

and  the  probabilities  q(i-l-l)  of  the  exchange  succeeding  on  die  i-lh  retransmission  form  a  geometric 
distribution  with  parameter  q 
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q(i  +  l)  =  q'X(l-q) 

The  expected  time  for  a  1-packct  transfer  to  complete  is  then 

T(,(l)  +  (To(l)  +  T^)X(q-s-(l-q)) 

where  1'q(1)  is  the  time  necessary  for  a  l-packet  exchange  without  any  errors.  For  F)  packets  we  get  for  the 
expected  time 

M  =  D  X  I  r(,(l)  +  ( Tod)  +  )  X  ( q  -  (1-q) ) ) 

Let  us  next  consider  the  ease  of  full  retransmission  on  error  (without  a  negative  acknowledgement).  A 
D-packet  transfer  succeeds  in  this  ease  if  all  I)  packets  reach  the  destination  machine  and  the 
acknowledgement  packet  reaches  the  source  machine.  Assuming  independent  transmissions,  the  probability 
of  the  D-packet  transfer  failing  is  then 

q  =  l-(l-p„)‘^+l 

Given  this  probability  q.  the  probabilities  q(i+ 1)  that  a  MoveTo  attempt  succeeds  on  the  i-th  retransmission 
form  a  geometric  distribution  with  parameter  q 

q(i+l)  =  q‘X(l-q) 

ITic  expected  time  T(D)  for  a  l>packct  transfer  becomes 

II  =  To(D)  +  ( To(D)  +  'i; )  X  ( q  -  (1-q) ) 

Figure  5-3  compares  the  two  strategics  for  different  values  of  1’^  (The  other  parameters  in  the  figure  are  D 
=  64,  T|j(l)  =  5.8  msec,  and  Tjj(r))  =  198  msec.,  the  latter  two  from  experimental  measurements.) 
Additionally,  Figure  5-4  shows  the  effect  of  breaking  up  very  large  data  transfers  in  different  size  of  MoveTos: 
shown  in  die  figure  is  a  512  kilobyte  transfer,  broken  up  in  8, 64  and  512  kilobyte  operations. 

Clearly,  for  typical  values  of  p^^  (on  the  order  of  10'^),  the  expected  time  for  both  strategics  is  nearly 
identical  to  the  no-failure  expected  time  and  relatively  independent  of  the  actual  value  of  I'l..  Ihc 
retransmission  interval  only  aflccts  the  location  of  die  knee  in  die  curve  which  falls  outside  the  domain  of 
typical  values  of  p„.  In  comparing  the  resides  for  die  stop-and-wait  protocol  and  the  streaming  prouxiol,  die 
key  obsenation  to  make  is  that  ’lo(l3)  "  the  no-failure  value  for  the  streaming  proUKol  -■  is  significantly 
smaller  than  I)  X  'lo(l)  ”  die  comparable  value  for  the  stop-and-wait  proUx;ol  (See  Section  3.4.3). 
Consequently,  for  low  error  rates,  where  this  term  dominates,  die  slrciiming  strategy  performs  significantly 
better,  riiere  remains  the  question  of  what  the  optimal  streaming  size  shindd  be:  Figure  5-4  shows  diat  the 
improvement  in  no-failure  behavior  becomes  small  as  the  streaming  size  is  increased  abtivc  8  kilobytes. 
Smaller  sizes  arc  evidently  also  more  robust  against  higher  failure  rates^. 

Ihcsc  results  also  allows  us  to  make  a  stronger  conclusion:  since  die  expected  time  for  streaming  with  the 
crudest  retransmission  strategy  --  full  rctransmis.sion  on  error  and  no  negative  acknowledgement  -  results  in  a 
nearly  optimal  expected  time  (for  the  appropriate  range  of  p^  values),  no  significant  improvements  in 
expected  time  can  be  achieved  by  more  sophisticated  retransmission  strategics.  In  die  next  section,  we  show 
that  such  strategies  can  significantly  improve  die  sutndard  deviation. 

5. 6. 2. 5  Standard  Deviation 

Given  the  results  of  analyzing  die  expected  times,  we  consider  in  the  rest  of  this  discussion  only  strategics 
dial  have  optimal  no-failure  characteristics.  We  also  assume  that  we  operate  under  such  error  conditions  diat 
die  expected  time  of  die  transfer  is  ne.iiiy  identical  to  the  no-failure  transfer  time  (i.c.  we  operate  in  the 
leftmost  region  of  die  curves  in  Figure  5-3).  We  nviw  analyze  die  standard  deviation  of  different  strategics. 

Consider  a  given  transmission  strategy  and  dciuitc  by  Iq(I))  die  elapsed  time  in  die  no-failure  ease. 


In  Ch.ipicr  4  we  h.ivc  .ilrcady  indicalcd  a  nuinber  of  reasons  why  usinp  targer  inlerarlion  si/cs  between  a  Tile  server  and  its  clients 
dix’s  not  result  in  significant  gain  once  the  interaction  si/c  ls  increased  beyond  8  kilobytes. 


MI  SSACI'  PASSING  ON  A  l  OCAl  NITIAVORK 


PROTOCOL  AND  IMPLEMENTA  HON  EXPERIENCE 


expected 

time 


Tj  =  580  ms 


=  58  ms 


T,  =  5.8  ms 


Stop-and-Wait 


T,  =  \pm 


Streaming 


T  =  198  ms 


Tj  =  5.8  ms 


^8  -1  -6  ^5.  -4  -3  -2 

10  10  10  10  10  10  10 


Pigure  5-3:  hxpcctcd  I  ime  for  64  kilobyte  I  ransters 

Furilicrmorc,  let  be  the  elapsed  lime  for  die  k-Ui  retransmission,  let  be  die  interval 

interval  between  die  k-lli  and  die  (k  -t- 1  )-lii  retransmission,  and  finally  let  q(i  -f- 1 )  be  tW  probability  of  success 
on  die  i-lh  retransmission.  ITicn,  if  die  transfer  succeeds  on  the  i-ih  retransmission,  the  louil  elapsed  time  for 
this  transfer  is 

Assuming  we  arc  operating  under  low  error  conditions  and  dial  thus  die  expected  time  is  constant  and 
approximately  equal  to  'l'jj(l)).  we  get  for  the  variance 


•,<k^  1), 


^  x  qo-t-Di-rJu)) 

This  formula  indicates  three  potentially  fruitful  avenues  for  reducing  die  variance: 

1.  Reduce  the  retransmission  intervals  this  can  be  accomplished  either  by  choosing  a  small 

timeout  value  or  by  providing  a  negative  acknowledgement  when  the  transfer  fails. 

2.  Reduce  die  transmission  time  for  retransmissions:  this  can  be  done  by  reducing  the  number 

of  packcLs  to  be  sent  on  retransmission.  I  hc  negative  acknowledgement  can  carry  information  as  to 
which  packets  were  successfully  received. 
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l‘’iuurc  5-4;  l  lxpcctcd  l  ime  for  512  kilobyte  rransfer 

3.  Reduce  the  probability  of  failure  of  the  retransmissions:  since  we  are  assuming  independent  failures, 
tliis  probability  is  only  dependent  on  the  number  of  rackets  transmitted.  Thus,  here  also  reducing  the 
number  of  packets  sent  has  a  beneficial  cfTccL 

Clearly,  ;i  combination  of  tliese  different  approaches  seems  optimal  and  to  some  extent  straightforward. 
However,  we  analyze  tlic  dilTcrcnt  methods  in  isolation  to  assess  their  relative  benefits.  In  particular,  we 
consider  the  following  retransmission  strategies,  sbirting  from  tlic  most  straightforward; 

1. 1'ull  rctransmis.sion  on  error  without  negative  acknowledgement  (with  dilTcrcnt  retransmission 
intervals). 

2.  Full  retransmission  on  error  with  a  negative  acknowledgement  after  the  last  packet 

3.  Selective  retransmission. 

In  the  case  of  full  retransmission  on  error  without  negative  acknowledgement  tlic  probabilities  q(i  + 1)  form 
a  geometric  distribution  with  parameter 

Furtlicrmore,  for  all  k. 
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Figure  5-5:  Suindard  Deviation  for  a  64  Kilobyte  MoveTo 

F{sscntially.  in  order  t()  achieve  a  low  variance  when  using  full  retransmission,  we  need  to  choose  'rj,  small 
compared  to  1^(1)).  Iliis  can  be  done  as  shown  above  by  choosing  a  (physically)  small  value  of 'I'. 
Alternatively,  an  cflcctively  small  value  of'l'^  can  be  jichievcd  by  using  a  negative  acknowledgement,  while 
still  mainUiining  a  much  larger  physical  value  of  1'^.  We  use  the  following  strategy: 

1.  If  the  recipient  sees  the  last  packet,  it  sends  either  a  positive  or  a  negative  acknowledgement  depending 
on  whether  or  not  it  received  all  packets  in  the  sequence. 

2.  If  tlic  sender  gets  a  negative  acknowledgement,  or  if  tlie  sender  does  not  receive  any  acknowledgement 
within  a  time  interval  'l'|..  it  retransmits  the  whole  sequence  of  packets. 

ITic  characteristics  of  this  strategy  can  be  derived  by  Ute  following  argument.  In  the  absence  of  a  negative 


Mt  SSAGi:  PASSING  ON  A  IXXTAI.  NiriWORK 


STANDARD  DEVIATION 


71 


acknowledgement,  every  failed  transmission  always  takes  T(j(D)+T^.  In  the  presence  of  a  negative 
acknowledgcmcnL  the  length  of  a  failed  transmission  varies  depending  on  whether  such  a  negative 
acknowledgement  was  sent  and  received.  LXmote  by  'l}j((l))  the  length  of  the  k-th  failed  transmission,  dnd  by 
q(i+ 1)  the  probability  of  success  on  the  i-th  retransmission,  tlicn  the  expected  time  becomes 

II  =  To(D)  +  21=“  {  )  X  q(i+l)  ) 

We  approximate  the  itinef  siim'  by ' 

2];:\T^^(D);=:iXTf(D) 

where  T^^D)  is  the  expected  time  of  a  transmission^.  Then,  the  expected  time  for  the  AfoveTio  becomes 

/t  S  To(D)  +  T^D)  X  2’:  “  ( i  X  q(i  + 1)  ) 

Noting  again  that 

q(i+l)  =  q’X(l-q) 

with 

q  =  l-(l-pjO+l 

we  get  finally  for  the  expected  time 

/i=:To(D)  +  T^D)X(q-t-(l-q)) 

For  the  values  of  p^  that  arc  of  interest,  the  expected  time  is  approximately  equal  to  Tj|(D).  Given  this,  the 
variance  is  approximately 

2;=“  (  5:E=;t^^(D)  f  X  q(i  + 1) ) 

Using  the  same  approximation  for  the  inner  sum,  and  substituting  for  q(i  + 1),  we  get 

<r=  I,(n)X(q’'^X(l+q)’''^H-(l-q)) 

Wc  now  derive  the  value  of  I'/D).  A  failed  transmission  cither  takes  folD)  or  ’r„(t))+'r..  'live,  ilrst  case 
occuis  under  the  following  conditions;  not  all  of  the  (1)1)  fii'st  packets  arrived  at  tneir  dcslinalion,  tlic  last 
packet  arrived  at  its  destination,  and  the  negative  acknowledgement  arrived  at  ius  destination.  'ITic 
(conditional)  probability  of  tliis  happening,  assuming  the  overall  transmission  failed,  is  the  product  of  the 
individual  probabilities  of  tlic  above  events,  divided  by  tlie  probability  of  a  failure 

[(l-(l-p^)'’-')X(  l-pj2]-(l-(l-p/+') 

For  values  of  p^  •^  ( 1  +  D ),  this  is  approximately 

(f)-l)-(|)+i) 

ITie  failed  transmis.sion  takes  'rp(l))+'l'^  in  the  case  that  cither  tlic  last  packet  or  the  negative 
acknowledgement  get  lost.  ITic  prooability  of  tliis  happening,  assuming  tlicrc  is  a  failure  is 

(I-(l-pj2)-(i-(i.pji>+l) 

which,  again  for  p^  <  ( 1  +  D )  reduces  to 

2H-(l)+  1) 

The  expected  time  for  a  failed  transmission  is  then  approximately  equal  to 

Vp)  =  ( (1>1)  -5-  (D+  1) )  X  Ipd))  +  ( 2  -  (1)+  1) )  X  (  I  p(l))  +  1; ) 

If  D  <  1,  then  clearly 


2^p 


fhis  ,ipproxim.ilion  bccomcK  exact  wiih  prob.ibility  I  .is  i  lends  to  inniiily  I'or  finiic  v.ilucs  of  i,  wc  believe  it  is  a  good 
approximation,  since  the  values  orT|.|j(D)  arc  densely  clustered  around  ibeir  expected  lime  T|<D). 
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T^D)syD) 

and  finally  wc  get  for  the  standard  deviation 

a  T(,(D)  X  (  X  (1  +q)‘'‘  -  (1-q) ) 

This  formula  indicates  that  the  standard  deviation  when  using  full  retransmission  with  a  negauve 
acknowledgement  is  all  but  independent  of  the  retransmission  interval.  Ihe  values  of  o  for  different  values  of 
are  set  out  in  figure  5-5  for  comparison  witli  full  retransmission  without  negative  acknowledgcmcnL  , 

By  either  choosing  a  small  retransmission  interval  or  by  using  negative  acknowledgements,  we  have 
minimi/xd  the  component  of  the  standard  deviation  that  is  dependent  on  the  retransmission  interval.  The 
standard  deviation  is  also  dependent  on  the  amount  of  data  retransmitted  during  retransmissions.  This 
component  can  be  minimized  using  selective  retransmission. 

Selective  retransmission  is  implemented  as  follows.  In  order  to  execute  a  MoveTo  containing  D  packets, 
(D-1)  packets  aic  transmitted  without  acknowledgement.  'Ihe  last  packet  is  sent  reliably:  it  is  retransmitted 
periodically  until  an  acknowledgement  is  received^’.  The  acknowledgement  to  the  last  packet  indicates  which 
of  the  D-1  unreliably  transmitted  packets  got  to  their  destination.  If  D'  did  not  get  there,  they  need  to  be 
retransmitted  using  tlie  same  mctliod:  transmit  D'-l  packets  unreliably  and  tlic  last  packet  reliably.  This 
pi'Kcdurc  continues  until  all  packets  get  to  tlicir  destination.  Ihe  above  observations  allow  us  to  derive  the 
following  recurrence  relation  for  r(l)).  Wc  denote  by  I’  ,^1(1)  die  time  necessary  to  transmit  i  packets 
unreliably.  Ihe  probability  that  tlic  f^ull  transmission  succccos  on  the  first  try  is  equal  to 

(1-P„r 

'The  probability  that  i  packets  get  lost  on  the  first  try  is 

C(D-l.i)Xp;X(l-p„)“‘^-‘ 

where  C(D-l.i)  is  the  number  of  combinations  of  D-1  by  i.  The  time  ’I'(D)  to  transmit  D  packets  with  selective 
retransmission  is  dicn 

-t-  iT(i)  X  c(D-i.i)  X  p;  x  (i-p^)®-*'*  i 

whereby  furthermore 

T(l)  =  T^,(l)  =  T„(l)-»-  i;X(q-(l-q)) 

q  = 

and 

'runrclW  =  ‘X''unrc|(l) 

Ihe  standard  deviation  associated  with  diis  retransmission  strategy  is  difficult  to  derive  analytically. 
Ihcrcforc  wc  have  simulated  the  priKcdurc  by  computer  and  determined  both  the  expected  time  and  the 
variance  from  the  simulation.  Figure  5-5  shows  die  sLindard  deviation  observed  in  die  simulation.  'Ihe  figure 
clearly  indicates  that  the  behavior  of  this  rctiansmission  strategy  with  respect  to  the  standard  deviation  is 
superior  to  strategies  that  require  full  retransmission  on  error. 

5. 6. 2. 6  Conclusions 

Given  sufficiently  low  error  rates,  the  expected  time  for  large  data  transfers  is  almost  solely  dependent  on 
fjj.  the  transfer  time  when  there  arc  no  erroi's.  Ihcrcforc.  die  retransmission  strategy  is  of  little  or  no 
influence  on  tlic  expected  transfer  time.  Assuming  indcpcndciu  network  failures,  wc  have  quantified  the 
assumption  of  "significantly  low  error  rates".  Wc  expect  most  local  networks  to  perform  according  to  diis 


27 


Although  il  Ls  luH  shown  here,  this  siralcgy  performs  idcniicaity  lo  ihc  previous  one,  if  full  rather  ih.in  selective  retransmission  is 


done. 
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assumption.  The  advantages  of  streaming,  i.e.  transferring  the  data  without  per-packet  acknowledgements, 
under  these  circumstances  are  quite  clear. 

When  considering  higher  error  rates,  or  more  importantly,  when  considering  the  standard  deviation  of  the 
transfer  time  at  low  error  rates,  the  effects  of  different  retransmission  strategies  start  showing.  In  particular, 
we  have  shown  that  the  standard  deviation  at  low  error  rates  increases  with  the  amount  of  data  retransmitted, 
the  retransmission  interval  and  the  probability  of  further  failures  during  retransmission  (which  is  a  function  of 
the  amount  of  data  retransmitted,  at  least  when  assuming  independent  failures).  A  short  effective 
retransmission  interval  can  be  achieved  by  returning  a  negative  acknowledgement.  Given  the  presence  of 
such  a  negative  acknowledgement,  the  amount  of  data  can  then  be  reduced  by  having  the  negative 
acknowledgement  carry  information  about  the  packets  received  and  by  using  selective  retransmission  on  the 
part  of  the  sender. 

5.6.3.  Remote  Binding 

A  client  can  locate  a  server  (i.e.  find  out  his  process  identifier)  by  means  of  a  GetPid,  assuming  the  server 
has  previously  done  a  SetPid.  A  very  brief  discussion  of  the  mechanisms  supporting  GetPid  and  SetPid  is 
included  in  this  chapter  for  the  sake  of  completeness.  An  alternative  implementation  relying  on  the 
application-level  use  of  multicast,  is  described  in  Section  6.6.2.  This  implementation  does  not  require  any 
kernel  or  protocol  support  whatstKver  and  results  in  a  far  more  flexible  binding  mechanism. 

In  order  to  support  the  SetPid-GetPid  mechanism,  each  kernel  maintains  a  table,  called  the  LogicalldMap. 
ITiis  table  has  a  configurable  number  of  entries,  each  of  which  has  two  fields,  namely  a  process  identifier  and 
a  scope.  In  response  to  a  SetPid(logicalld,pid,scope),  tlie  kernel  fills  in  the  corresponding  entries  of  the  table 
entry  with  index  logicalld  with  the  given  pid  and  scope.  Any  previous  entry  is  overwritten. 

GetPids  are  handled  in  the  following  way.  Kirst.  when  the  scope  of  the  GetPid  is  local,  the  kernel  simply 
looks  up  tlic  table  entry  with  index  logicalld  and  returns  the  corresponding  prtxress  identifier,  assuming  the 
scope  of  tlic  entry  is  not  remote.  When  the  scope  is  any.  then  Uie  local  table  is  searched  first,  and  when  no 
successful  match  is  found,  or  when  tlic  process  identifier  matching  the  index  is  found  to  correspond  to  a  no 
longer  existing  pnKcss,  the  remote  GetPid  routine  is  invoked.  The  latter  routine  is  invoked  immediately  if  the 
scope  of  die  request  is  remote. 

riic  remote  broadca.sLs  a  remoteGctPid  paciaci  on  the  network  asking  all  kernels  if  they  have  the 

desired  mapping’**.  When  such  a  rcmotcGctPid  packet  arrives,  each  kernel  looks  in  its  l^gicalldMap  and 
checks  whether  it  has  a  pnKCss  registered  for  tlie  requested  logical  identifier  (with  scope  either  any  or  remote). 
If  so.  and  if  the  prcKCSs  identifier  registered  corresponds  to  a  valid  process,  it  puls  a  rcmotcGetPidRcply 
packet  on  the  network.  This  packet  is  sent  point-to-point  to  the  requesting  kernel.  Many  such  packets  may 
arrive  at  tlie  requesting  kernel,  since  many  machines  may  have  a  process  registered  for  tliat  logical  identifier. 
The  requesting  kernel  simply  takes  tlie  first  to  arrive. 

Iliis  concludes  tlie  specification  of  the  V  intcrkerncl  protocol  and  tlie  discussion  of  its  characteristics.  We 
now  turn  our  attention  to  some  aspects  of  tlic  implementation  of  Uiis  protocol  as  part  of  the  distributed  V 
kernel. 


5.7.  Some  Aspects  of  the  V  Kernel  Implementation 

I'he  V  kernel  has  three  main  modules:  the  intcrpriKCss  communication  module,  the  device  module,  and  tlie 
prixcss  and  memory  management  module.  The  way  these  arc  structured  has  some  iiuercsling  repercussions 
on  tlic  way  rcinoic  kernel  operations  tan  be  performed.  This  is  discussed  in  Section  5.7.1.  We  next  turn  our 
attention  to  the  interprocess  communication  module  and  in  particular  to  the  way  in  which  it  accommodates 
remote  interprocess  communication.  We  single  out  two  topics  for  discussion.  I  'irst,  network  interprocess 


28. 


ITic  logioil  identifier  is  contained  in  the  fonmnterPid  field. 
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communication  in  the  V  kernel  is  handled  within  tlic  kernel  (as  opposed  to  through  a  process-level  network 
server).  We  discuss  the  motivation  for  this  choice  and  its  ramifications.  Second,  we  show  how  by  proper 
design  different  kinds  of  network  interfaces  can  be  accessed  through  a  common  procedural  interface,  while 
still  taking  advantage  of  the  capabilities  of  each  interface. 


5.7.1 .  Overall  Kernel  Structure 

The  kernel  consists  of  three  main  modules:  the  interprocess  communication  module,  the  device  server 
pseudo-process,  and  the  kernel  server  pseudo-process.  The  latter  two  are  called  pseudo-processes  because, 
although  to  other  processes  they  look  like  regular  processes  (and  the  normal  interprocess  communication 
primitives  are  used  to  communicate  with  them),  in  reality  they  arc  a  set  of  procedures  within  the  kernel.  The 
kernel  server  is  a  pseudo-process  that  provides  a  message  interface  to  the  kcmcl’s  process  and  memory . 
management  operations.  The  device  server  provides  message-based  access  to  devices.  In  practice,  clients 
communicate  with  the  device  server  more  conveniently  through  the  V  I/O  protocol  [13].  The  kernel 
commonly  supports  the  keyboard,  the  console,  the  frame  buffer,  the  mouse,  the  Kthcmct  and  the  clock.  The 
kernel  can  also  be  configured  to  support  one  or  more  disks,  for  die  file  server  or  for  workstations  with  local 
disks. 

Both  from  an  efficiency  and  from  an  integrity  standpoint,  it  is  attractive  to  perform  process,  memory  and 
device  management  in  the  kernel.  For  instance,  some  of  the  kernel  server  operations  manipulate  the  kernel’s 
process  descriptor  data  structures.  From  an  integrity  standpoinL  it  is  desirable  that  these  data  structures  are 
only  accessible  from  inside  the  kernel.  Similarly,  having  the  kernel  manage  devices  allows  some  measure  of 
protection  since  the  actual  device  operations  arc  now  executed  by  the  kernel  and  not  by  a  process  whose 
authorization  might  be  difficult  to  check.  Second,  access  to  devices  usually  requires  execution  of  privileged 
instructions  or  access  to  protected  memory  locations,  so  part  of  the  device  operation  has  to  be  done  in 
privileged  (kernel)  mode  anyways.  Doing  device  management  outside  the  kernel,  while  saving  on  kernel 
space  and  complexity,  therefore  usually  requires  two  extra  context  switches  and  additional  copying  of  data. 
This  can  significantly  degrade  overall  performance  [81).  Note  diaL  unlike  in  Unix  4.2  Bsi)[52]  for  instance, 
the  file  system  and  the  internet  protocol  software  still  reside  outside  of  the  kernel.  Only  tlic  device  drivers  for 
the  disk  and  the  network  interface  arc  present  in  tlic  kernel.  The  resulting  increase  in  kernel  size  is  relatively 
small  and  well  justified  given  its  performance  benefits. 

Given  the  fact  that  we  want  to  implement  these  functions  in  tlic  kernel,  we  observe  a  number  of  significant 
advantages  to  doing  so  by  means  of  pscudo-proces.scs  which  arc  accessed  by  the  regular  mcs.sagc 
primitives  [63].  FirsL  it  provides  a  uniform  model  of  access  to  clients,  allowing  tlicm  to  be  oblivious  to  the 
fact  that  these  operations  arc  actually  implemented  in  the  kernel.  Second,  there  is  no  additional  protocol 
complexity  for  implementing  these  kernel  operations  remotely.  For  instance,  destroying  a  remote  priKCSS 
(subject  to  certain  privileges)  is  done  by  sending  a  message  to  the  remote  kernel  server.  The  V  protocol  can  be 
used  to  carry  this  message  without  any  need  for  additional  protwol  primitives  (packet  types).  Finally,  this 
design  has  a  number  of  potential  advantages  that  have  at  this  point  not  flilly  been  explored;  it  reduces  the 
number  of  kernel  primitives  making  it  possible  to  have  a  single  machine  "trap"  per  primitive;  it  effectively 
separates  tlie  implementation  of  interprocess  communication  from  other  kernel  services,  allowing  one  to 
contemplate  hardware  or  microcode  support  for  the  interprocess  communication  module;  and  finally,  it 
allows  for  interposition  of  a  debugger  or  monitor  process. 

Although  part  of  the  general  intcrprcKess  communication,  device  access  is  not  transparent  It  is  quite 
possible  to  remedy  tliis  situation,  but  this  has  not  been  done  so  far  for  apparent  iack  of  demand  and  because 
of  die  costs  involved.  These  costs  result  primarily  from  the  need  to  buffer  large  amounts  of  data  in  die  kernel. 
This  would  become  necessary  if  data  from  a  device  were  to  be  sent  to  a  remote  process.  Indeed,  that  data 
would  have  to  be  kept  around  for  potential  retransmissions.  Similarly,  some  of  die  kernel  server  operations 
are  restricted  to  privileged  system  prrKCSscs.  For  instance,  alUx;ating  an  address  space  for  a  new  program  to 
nin  in,  is  restricted  to  the  so  called  team  server  prixress.  In  order  to  start  a  program  remotely,  one  therefore 
has  to  go  dirough  die  team  server  on  the  remote  machine,  which  can  be  instructed  whether  or  not  to  accept 
such  remote  requests.  We  have  found  this  restriction  to  be  useful  since  die  kernel  server  operations 
manipulate  rather  vital  machine  resources. 
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5.7.2.  The  Interprocess  Communication  Module 

When  a  kernel  request  is  directed  at  a  remote  process,  or  when  a  packet  carrying  a  remote  request  comes 
the  network  interprocess  communication  module  is  invoked.  In  the  V  kernel,  the  entire  network  intctproces^ 
communication  module  resides  within  the  kernel.  For  instance,  for  a  Send  addressed  to  remote  process,  we 
call  a  NonLocalSend  routine  in  the  kernel,  which  prcKceds  to  put  a  packet  on  the  network.  When  that  packet 
arrives  at  tlie  receiver’s  kernel,  the  network  ipterrupt  handler  invokes  the  appropriate  kernel  routine  tft, 
process  the  message,  i.e.  put  it  in  the  receiver’s  message  queue  and  if  necessary,  unblock  the  receiver, 

An  alternative  organization  would  be  to  have  a  process-level  network  server  handle  the  network  aspects  of 
remote  request.  Again  taking  the  example  of  a  Send  to  a  remote  process,  die  kernel  now  forwards  tlic.reqpesji[^ 
to  the  neu  ork  server  process,  which  (most  likely  via  another  kernel  call)  handles  transmission  of  the  packet 
over  the  network.  On  the  receiving  end,  the  kernel  network  device  handler  passes  the  packet  on  to  the 
network  server  on  his  machine,  which  then  forwards  iL  again  via  a  kernel  call,  to  the  final  receiver. 

5.7. 2.1  Performance  Issues 

The  tradeoff  between  the  two  approaches  is  mainly  one  of  performance.  Implementing  network* 
interprocess  communication  outside  of  the  kernel  causes  extra  kernel  traps,  extra  process  switches  and  extra^ 
copy  operations.  Let  us  follow  in  detail  the  execution  of  a  remote  Send-Receivc-Reply  sequence  under  both, 
implementation  strategics  and  under  tlic  assumption  that  no  other  processes  tlian  the  .sender,  the  receiver,  and' 
the  respective  network  servers  arc  present  Consider  Figure  S-6.  In  tliis  figure,  an  arrow  represents  a  transfer 
of  control  in  or  out  of  the  kernel.  For  the  purposes  of  interprocess  communication  each  transfer  causes  a  copy 
of  the  message  to  happen.  We  count  tlic  expense. of  the  different  operations  as  follows:  a  kernel  trap  is  the 
sum  of  an  entry  into  and  an  exit  from  the  kernel;  a  process  switch  is  the  loading  of  the  volatile  storage 
associated  with  a  process.  From  Figure  5-6  it  can  be  seen  that  an  implementation  of  remote  intcrprocC^^ 
communication  in  the  kernel  requires  2  kernel  traps.  0  process  switches  and  4  copies.  An  implementation 
outside  of  the  kernel  requires  6  kernel  traps.  4  process  switches  and  12  copies.  A  kernel  trap  requifc^ 
approxl.u.itcly  60  microseconds,  a  prtx:css  switch  about  150  microseconds  and  copying  32  bytes  takes, 
approximately  25  microseconds.  'I'his  would  indicate  that  an  implementation  outside  of  tlie  kernel  would 
require  approximately  1  millisecond  more  in  processor  time  on  tlic  sender  and  the  receiver  machine 
combined.  1  he  difference  in  elapsed  time  is  more  difficult  to  estimate  given  the  potential  for  concurrcncyji 
Iksidcs,  extra  machinery  for  generating  timeouts  and  retransmissions  causes  additional  puiccss  switches,  and 
thus  additional  expense.  Moreover,  as  has  been  noted  in  many  priKCss-lcvel  proux.H)l  implementations  [20], 
process  scheduling  can  be  a  additional  source  of  real-time  delay.  ' 

5. 7. 2. 2  Synchronization  and  Interrupt  Disable  Time 

Clearly,  then,  tlicrc  arc  significant  performance  gains  to  be  obtained  from  a  kernel-level  implementation  of 
the  remote  interprocess  communication  module.  However,  such  a  concession  with  respect  to  modularity. 
not  without  its  price.  I’hc  handling  of  a  remote  request  is  now  done  at  the  network  interrupt  level.  Sinct^ 
handling  such  a  request  may  involve  manipulating  message  queues,  it  needs  to  be  synchronized  with  loca(. 
(kernel)  operations  manipulating  tlic  same  data  structures.  Additionally,  the  timer  interrupt  can  cause  various 
retransmissions  or  timeouts  to  take  place,  again  potentially  accessing  message  queues.  We  first  show  how  wc 
maintain  synchronization  by  selectively  turning  off  interrupts.  Wc  tlicn  provide  some  estimates  with  respect 
to  maximum  interrupt  disable  time. 

Although  a  finer  grain  of  synchronization  is  conceivable,  wc  have  chosen  to  turn  off  all  network  interrupts- 
during  kernel  operations  (both  for  local  and  remote  requests).  Additionally,  with  respect  to  remote- 
operations,  a  timer  interrupt  can  cause  one  of  tlircc  things  to  happen' 

1.  A  request  for  ;i  remote  process  is  retransmitted. 

2.  A  request  for  a  remote  process  is  timed  out,  and  the  requester  is  added  to  tlic  ready  queue  with  some 
appropriate  indication  of  failure. 

3.  A  request  from  a  remote  process  is  timed  out  (i.c.  the  kernel  has  not  received  any  retransmission  of  tliCs 
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Figure  5-6;  Kernel-level  vs.  Process-level  Implementation 

request  for  a  time  longer  than  the  timeout  interval  Tj),  its  packet  buffer  is  deallocated  and  references  to 
it  arc  removed  (Most  likely,  such  a  reference  would  be  in  the  message  queue  of  a  local  process.) 

•  !•»'  I 

Turning  off  the  timer  intermpt  altogether  during  kernel  operation  would  distort  the  proper  operation  of  the 
kernel’s  timing  services.  Nevertheless,  we  must  prevent  uncoiitrr)llcd  concurrent  access  to  mcs.sagc  queuesi 
'llicrcforc.  we  allow  the  timer  interrupt  to  wcur  at  any  time,  and  we  allow  the  corresponding  routine  to' 
perform  all  of  its  dulic.s,  except  when  tlic  timer  internipts  while  Uie  kernel  is  running.  In  that  case,  we  prevent 
it  from  iicccssing  Uie  message  queues. 

Having  thus  achieved  synchronization  by  selectively  turning  off  interrupts,  we  would  like  to  convince 
ourselves  dial  Uie  maximum  interrupt  disable  time  remains  relatively  small.  This  is  especially  important  for 
the  timer  interrupt  since  prolonged  disabling  might  m.nsk  Uie  next  timer  interrupt.  Reviewing  the  various 
events  a  timer  interrupt  might  cause,  the  most  common  operation  is  the  (rc)lransmissioii  of  a  single  packet 
This  causes  Uie  timer  interrupt  to  remain  disiibicd  while  the  packet  is  being  copied  out  from  main  memory  to 
Uie  network  interface  (  The  actual  network  transmission  pnxrccds  asynchronously.)  For  a  maximum  size 
packet,  we  can  estimate  this  copy  to  take  about  1.5  millisminds,  significantly  lower  than  Uie  tinier  interval  of 
10  milliseconds.  Ihe  situation  could  degrade  significantly  if  network  transmission  had  to  be  done 
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synchronously  on  a  slow  network  or  if  the  timer  interval  was  chosen  much  shorter.  A  very  unlikely  worst-case 
can  occur  as  a  result  of  a  remote  request  being  timed  out.  'fhis  causes  the  kernel  to  search  through  the 
message  queue  of  the  intended  recipient  of  that  request  in  order  to  remove  the  message  buffer  from  that 
queue.  While  potentially  long,  the  queue  length  and  thus  the  search  time  arc  still  bounded  by  the  maximum 
n  '  ’r  of  message  buffers  the  kernel  has  allocated. 

Network  interrupts  are  disabled  altogether  during  kernel  operation.  Prolonged  disabling  of  the  network 
interrupts  could  cause  a  problem  if  packets  arc  not  removed  quickly  enough  from  the  device  input  queue, 
causing  the  queue  to  overflow  and  packets  to  be  dropped.  While  this  is  not  in  contradiction  with  the 
definition  of  the  Etlicrnct  data  link  level,  it  could  lead  to  undue  performance  degradation.  We  have  no  data 
available  on  the  percentage  of  time  or  the  maximum  length  of  time  the  network  interrupts  arc  disabled.  In 
practice,  it  has  not  been  observed  to  be  a  problem. 

5. 7. 2. 3  Concluding  Remarks 

Placing  network  interprocess  communication  inside  or  outside  of  the  kernel  is  a  tradeoff  of  performance  vs. 
modularity.  We  have  quantified  the  cost  associated  with  using  a  network  server  process  for  remote 
interprocess  communication.  We  have  also  shown  the  problems  one  has  to  deal  with  as  a  result  of  a  kernel- 
level  implementation,  particularly  with  respect  to  interrupt  disabling.  Our  experience  indicates  that  when 
proper  care  is  taken  in  the  implementation,  and  when  high  performance  interprocess  communication  is 
desired,  the  performance  advantages  of  a  kernel-level  implementation  outweigh  the  disadvantages. 

5.7.3.  Packet  Transmission  and  Reception 

At  present,  we  are  using  or  anticipating  to  use  a  number  of  network  interfaces  with  quite  different 
characteristics.  We  divide  those  interfaces  into  four  categories  according  to  their  capabilities. 

1.  Programmed  I/O  interfaces:  the  processor  has  to  copy  the  packet  from  main  memory  over  the  I/O  bus 
to  die  network  interface.  The  network  interface  is  assumed  to  generate  a  single  interrupt  on  receipt  or 
on  successful  transmission  of  a  packet 

2.  Dma  mtcrfaccs;  the  network  interface  is  capable  of  accessing  a  contiguous  Dma  buffer  in  main 
memory. 

3.  Scatter-gather  Dma  interfaces:  the  network  interface  is  capable  of  composing  a  packet  out  of  a  number 
of  discontiguous  buffers  in  main  memory. 

4.  Prograniniabic  network  interfaces:  the  network  interface  can  be  programmed  on-board;  it  uses  DMA  to 
access  data  in  main  memory. 

In  implementing  remote  interprocess  communication,  we  would  like  to  provide  a  single,  device-independent 
prtKcduial  interface  to  tliese  different  devices.  Nevertheless,  we  want  tliis  procedural  interface  to  allow  the 
implcmenter  sufficient  freedom  to  take  advantage  of  tlie  capabilities  of  a  specific  interface.  In  particular,  we 
would  like  to  avoid  all  unnecessary  copying. 

5. 7. 3.1  Packet  Transmission 

When  transmilling  a  packet,  (most  of)  the  header  and  the  mes.sage  are  present  in  the  process  descriptor  of 
tlic  sending  prcKcss.  I'hc  data  segmenL  if  any,  is  in  general  in  Uie  process’  address  space.  In  order  to  take 
maximum  advantage  of  the  capabilities  of  various  network  interfaces,  we  use  tlie  following  technique.  The 
prcxrcss  descriptor  is  laid  out  in  such  a  way  that  tlie  part  of  it  that  contains  the  packet  header  infonnation  and 
the  messiige  is  laid  out  contiguously,  in  the  same  format  as  the  packet  itself  This  requires  only  a  minor 
amount  of  extra  space  in  tlie  prwess  deseripior  (20  bytes).  Ihrec  extra  fields  are  allixrated  in  the  process 
descriptor  for  tlie  purpose  of  remote  communication;  seamentPointermd  sefimcntSize  indicate  the  beginning 
and  the  size  of  tlie  segment,  if  any,  that  is  associated  with  tliis  request,  ciiircnl Pointer  is  reserved  for  use  by 
tlie  network  interface  (see  below).  I’hc  segment  size  is  not  restricted  to  ;i  single  packet:  for  MoveTo  and 
MoveFrom  operations,  it  consists  of  tlie  entire  segment  to  be  transmitted.  In  order  to  have  the  packets 
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associated  with  this  request  transmitted  onto  the  network,  a  single  call  is  made  on  the  routine 
Wr  IteKernel Packet (pd) 

where  /Mf  is  a  pointer  to  the  process  descriptor  of  the  requesting  process.  Although  the  procedural  interface  to 
WriteKemelPacket  is  independent  of  any  particular  network  interface,  the  further  execution  of  the  call 
depends  entirely  on  the  particular  interface  used. 

On  a  programmed  I/O  interface,  the  processor  moves  the  packet  header  from  the  process  descriptor  into 
the  network  interface,  followed  by  the  first  part  of  the  se^ent  that  fits  in  a  packet  If  more  than  one  packet 
needs  to  be  sent  the  process  descriptor  is  put  on  a  queue.  When  the  transmitter  interrupt  indicates 
transmission  of  the  first  packet  is  finished,  the  next  packet  is  sent  in  the  same  way.  The  WriteKemelPacket 
routine  uses  the  ewrentPointer  f\c\<i  in  the  process  descriptor  to  keep  track  of  where  the  data  segment  of  the 
next  packet  is  to  come  from.  For  the  purpose  of  speeding  up  MoveTo  and  MoveFront,  it  is  extremely 
attractive  to  have  a  network  interface  tliat  provides  double  bufrering  on  transmission.  Referring  back  to 
Figure  3-8,  this  would  allow  us  to  operate  the  network  at  a  speed  close  to  its  advertised  data  rate. 

For  a  simple  Dma  interface,  the  header  and  the  appropriate  part  of  the  segment  are  first  copied  into  a 
contiguous  location  in  (kernel)  memory  before  the  Dma  operation  is  set  up.  For  remote  V  interprocess 
communication,  there  are  no  advantages  to  using  such  an  interface,  since  the  processor  must  make  a  copy  of 
the  packet,  as  for  programmed  I/O  interfaces.  Scatter-gather  Dma  devices  avoid  tliis  copy  by  passing  the 
Dma  device  pointers  to  the  header  and  the  appropriate  part  of  the  segment.  Multi-packet  transfers  arc  done 
in  the  same  way  as  for  programmed  I/O  interfaces. 

If  the  network  interface  can  be  programmed  on-board,  we  would  like  the  procedure  WriteKemelPacket  \a 
execute  entirely  on  the  interface.  TChis  would  not  only  relieve  the  processor  of  packet  transmissions  but  also  of 
the  management  of  the  queue  for  multi-packet  transfers.  The  processor  would  then  only  be  interrupted  when 
the  whole  packet  sequence  has  been  transmitted.  Additionally,  depending  on  the  sophistication  of  the  board, 
one  could  also  offload  retransmission  and  timeout  management  to  the  interface. 

5. 7. 3. 2  Packet  Reception 

On  the  receiving  end.  it  is  more  difficult  to  provide  a  uniform  interface  to  the  different  network  devices.  In 
order  to  reduce  the  amount  of  specialized  code  associated  with  remote  interprocess  communication,  remote 
mcs.sagcs  arc  queued  in  alien  priKcss  descriptors;  these  arc  regular  process  descriptors  but  they  represent 
messages  from  remote  priKcss.  ’Hie  kernel  always  has  an  alien  process  descriptor  ready  for  incoming  requests. 

On  a  programmed  I/O  device,  the  first  few  words  of  the  packet  arc  read  into  the  packet  buffer,  just  enough 
to  read  the  proUKol  type.  I'his  field  allows  the  kernel  to  distinguish  intcrkcrnci  packets  from  regular  Htlicrnet 
access.  If  the  packet  is  an  intcrkcrnci  packet,  the  rest  of  the  packet  header  is  read  into  the  packet  buffer.  If  a 
segment  is  appended  to  the  packet,  it  is  moved  immediately  into  its  final  destination. 

Regardless  of  whether  one  uses  a  simple  or  a  scatter-gather  Dma  interface,  there  docs  not  seem  to  be  a 
simple  way  in  which  to  avoid  an  extra  copy  if  a  segment  is  appended  to  an  incoming  packet.  For  simple  Dmia 
devices,  the  entire  packet  has  to  be  read  into  a  contiguous  Dma  buffer.  We  let  the  initial  part  of  this  buffer 
correspond  to  tlic  pnx:css  descriptor  for  the  incoming  request.  The  segment  is  tlicn  copied  by  die  priKcssor  to 
its  final  destination.  The  same  holds  essentially  for  scattcr-gallicr  Dma  interfaces:  tlie  extra  copy  is  ncccssaiy 
because  tlic  kernel  has  to  look  at  the  header  belbrc  it  can  decide  where  the  segment  goes.  One  could  however 
avoid  tlic  copy  by  receiving  tlic  segment  in  a  buffer  aligned  on  a  page  boundary.  If  the  corresponding  client 
buffer  is  page-aligned  as  well,  the  page  could  then  be  mapped  from  the  kernel’s  address  space  into  tlic  client's 
without  an  extra  copy. 

Finally,  if  the  interface  can  be  programmed  on-board,  the  program  executing  on  the  board  could  look  at  the 
header  and  Dma  the  appended  segment  immediately  to  its  final  location.  Additionally,  such  an  interface 
could  provide  a  number  of  useful  runclioiis,  such  as  discarding  broadcast  packets  not  of  interest  to  this  kernel, 
etc.  We  believe  such  an  interface  could  significantly  olTleiad  the  pnxrcssor  of  its  duties  with  respect  to  network 
interprocess  communication. 
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5.8.  Chapter  Summary 

In  this  chapter  we  have  related  to  the  reader  some  important  experience  gained  from  the  design  and  the 
implementation  of  the  V  interkemel  protocol.  We  have  separated  the  definition  of  the  protocol  from  its 
specific  implementation  in  die  V  kernel.  The  definition  of  die  prouxrol  consists  of  a  set  of  rules  governing 
process  naming,  process  location,  packet  fonnat  and  packet  sequencing.  In  summary,  we  have  shown  that 

1.  Domain-wide  unique  identifier  generation  by  announcement-complaint  algorithms  can  be  made 
arbitrarily  failure-resistant  at  die  expense  of  increased  running  time  of  the  algorithm. 

2.  Although  in  a  broadcast  domain,  process  location  by  broadcasting  is  the  method  of  choice,  wc  have 
chosen  point-to-point  communication  to  reduce  die  overhead  on  unrelated  prcxrcssors. 

3.  With  respect  to  message  exchanges,  wc  have  quantified  the  cost  of  cxactly-oncc  semantics  and  explored 
die  potential  of  idempotent  operations. 

4.  The  expected  time  of  large  data  transfers  is  primarily  dependent  on  the  no-error  transfer  time  which 
must  therefore  be  optimized  by  streaming.  'Ihc  variance  is  dependent  on  the  retransmission  strategy: 
negative  acknowledgements  and  selective  retransmission  arc  suggested. 

This  protocol  has  been  implemented  in  the  distributed  V  kernel  and  has  been  in  use  as  such  for  almost  two 
years.  Among  the  implementation  aspects,  wc  have  singled  out  two  important  aspects  for  discussion: 

1.  Wc  have  shown  that  the  problems  of  a  kcrncl-lcvc!  implementation  of  network  interprtwess 
communication  can  be  avoided  if  appropriate  care  is  taken  in  the  implementation,  and  that  the 
performance  benefits  in  our  environment  arc  significant 

2.  Wc  have  explored  tlic  potential  of  offloading  network  interprocess  communication  to  intelligent 
programmable  network  interfaces. 

Many  of  the  design  decisions  arc  predicated  on  intuitive  notions  about  the  underlying  hardware  and  about 
the  usiigc  of  the  prol(x:ol.  In  this  chapter  wc  have  developed  a  set  of  expressions  relating  the  performance  of 
the  proKvoI  to  characteristics  of  the  hardware  and  to  a  lesser  extent,  usage  characteristics.  While  wc  have 
some  idea  of  how  the  hardware  behaves,  no  data  of  any  significance  is  available  on  usage  patterns  in  a 
distributed  system.  Whether  our  daisions  arc  well  matched  with  typical  usage  patterns  Lhercrorc  remains  an 
open  question. 
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One-to-Many  Interprocess  Communication 


6.1 .  Introduction 

So  far  wc  have  discussed  the  most  common  form  of  message-based  interprocess  commiirication,  namely 
unc-lo-onc  coinmunicaiion:  a  single  sender  sends  a  message  to  a  single  receiver  and  (usually)  gets  a  response 
back.  I  01  example,  a  process  reads  a  file  page  by  sending  a  message  to  die  file  server  requesting  that 
particular  page.  ITic  file  server  then  returns  the  page  in  a  reply  messiige.  However,  a  process  may  need  to 
communicate  witli  a  group  of  processes  to  either  /lonyy  tliem  of  some  information  or  to  them  for 
infonnation.  Kor  example,  a  process  needing  file  service  may  want  to  query  all  server  processes  to  determine 
which  one  is  a  file  server,  or  which  server  stores  the  desired  file. 

fhe  need  for  one-to-many  communication  becomes  apparent  in  distributed  systems,  where  the  processing 
and  stor.ige  facilities  are  divided  among  several  nodes  connected  by  a  network.  In  this  environment,  in  tlic 
absence  of  global  shared  memory,  onc-to-many  communication  is  used  to  locate  services  that  would  otherwise 
be  advertised  in  a  table  stored  in  global  memory.  In  general,  shared  memory  can  be  viewed  as  a  broadcast 
communication  channel  (with  tlic  additional  capability  of  storage).  Information  of  general  interest  is  posted 
in  shared  menuny  and  then  searched  to  siitisfy  queries  from  dificrent  pr(x:csscs.  Broadcast  provides  a  similar 
communication  I'unction,  but  without  storage.  Information  is  multicast  to  a  set  of  possibly  interested  parties 
or.  alternatively,  a  process  multicasts  to  a  set  of  relevant  parties  to  locate  lire  desired  infonnation. 

l-'ortuna(ely.  most  distributed  systems  arc  connected  by  IcKal  network  and  bus  technologies  that  provide 
efficient  multicast  communication,  i.c.  onc-to-many  iiiicrfiosl  communication,  l-'or  example,  in  shared  bus 
technologies  like  l-thcrnet  [27],  each  packet  is  essentially  received  by  all  hosts  and  only  a  llltiMing  mechanism 
based  on  destination  iiddrcss  gives  the  appearance  of  point-to-point  comnuinictition.  By  providing  a  filter 
dial  accepts  nuiltipic  specified  tiddrcsscs,  (unreliable)  multicast  is  available  at  no  cxtni  cost  over  one-to-one 
communication,  both  to  the  sender  and  all  of  die  receivers.  Thus,  to  communicate  without  guaranteed 
delivery  with  N  workstations  lakes  a  single  packet  on  a  multicast  network  wherctis  siimiliiting  diis  using 
point-to-point  cominunictition  Uikcs  N  packets.  Clearly,  an  application  needing  onc-to-many  communication 
(which  m.iy  be  umeliiiblc)  can  be  implemented  much  more  cflkienlly  when  provided  with  efficient  access  to 
local  network  multicast  ditin  when  forced  to  use  one-to-one  communication  for  the  same  purpose. 

This  chtiptcr  describes  an  extension  of  the  V  interprocess  communication  to  provide  one-to-many 
communication  to  applications.  An  earlier  vcision  of  this  chapter  appeared  in  118].  Section  (\2  describes  the 
considerations  taken  into  account  when  integrating  multicast.  Section  6.3  describes  the  model  of  onc-to-many 
communication  wc  support  and  the  extensions  to  die  kernel  interface  ncccss;try  to  provide  this  facility. 
Section  6.4  presents  motivations  for  our  design  choices.  Section  6.5  discusses  the  implementation. 
Kxpcricncc  with  applications  and  ideas  for  use  are  presented  in  Section  6,6,  Section  6.7  describes  related 
work.  Conclusions  arc  drawn  in  Section  6.8  in  which  wc  also  point  out  some  remaining  open  qucstitins. 


6.2.  One-to-Many  Communication  Considerations 

Several  considerations  arc  recognized  in  extending  die  V  kernel  to  provide  onc-to-many  communication. 
First,  onc-to-m.iiiy  communication  should  be  provided  as  an  integral,  transparent  part  of  one-to-one 
interprocess  communication,  l•’roccsscs  should  be  able  to  send,  receive  and  reply  to  a  message  addressed  to 
multiple  processes  in  the  same  way  as  if  the  message  were  addressed  to  a  single  process.  This  avoids 
redundant  code  in  the  kernel  and  awkward  programming  at  the  process-level.  Wc  do  not  however  preclude 
the  ability  for  more  sophisticated  applications  to  discriminate  group  messages  when  necessary. 
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Second,  we  are  interested  in  determining  whether  the  V  kernel's  interprcKess  communication  is  smootlily 
extensible  to  onc-to-many  communication.  I'hc  large  body  of  existing  software  that  uses  the  one-to-one 
primitives  and  the  favorable  experience  in  using  it  make  it  unattractive  to  make  unnecessary  changes  to  it  or 
to  develop  an  independent  design.  However,  we  do  not  place  compatibility  with  the  existing  kernel  interface 
as  a  unquestionable  constraint  on  our  thinking. 

I'hird,  onc-to-many  communication  should  be  efficient  in  terms  of  delay,  network  bandwidth  and  processor 
usitge,  both  relative  to  the  performance  provided  by  liaidwarc  multicast  as  well  as  relative  to  the  performance 
of  one-to-one  communication.  We  arc  particularly  interested  in  using  onc-to-many  communication  to 
structure  the  communication  aspect  of  highly  parallel  distributed  computation  on  a  set  of  workstations  with 
no  shared  memory  (Sec  Section  6.6.3).  Ktficicnt  communication  is  crucial  in  achieving  this  goal. 

Finally,  in  keeping  with  the  design  philosophy  of  the  V  kernel,  we  wish  to  have  the  kernel  provide  a  simple 
but  efficient  and  flexible  mechanism  such  that  more  powerful  constructs  can  be  built  at  the  application  level 
(or  run-time  support  level). 

6.3.  One-to-Many  Extensions 

6.3.1.  Communication  Model 

A  pnxrcss  group  is  a  set  of  one  or  more  processes,  possibly  on  different  machines,  identified  by  a  group 
identifier.  All  processes  in  a  group  arc  equal;  there  arc  no  distinguished  members.  Processes  can  freely  join  or 
leave  groups,  and  arc  free  to  join  multiple  groups. 

A  group  idenlificr  is  identical  in  format  to  a  process  identifier.  Sending  to  a  group  involves  specifying  a 
group  identifier  to  Send  instead  of  a  process  identifier.  Any  prcxrcss  can  send  to  a  group,  including  non- 
members. 

A  prcKCss  may  elect  to  receive  zero,  one  or  multiple  reply  messages  in  response  to  a  message  sent  to  a  group. 
'Hie  kernel  handles  each  case  as  follows.  Ibc  zero-reply  case  is  handled  as  an  unreliable  multicast  to  the 
group  with  the  sending  process  not  bUxrking.  The  one-reply  case  blocks  the  sender  to  receive  one  reply 
message,  assuring  reliable  delivery  to  at  least  one  proccs.s.  Further  replies  arc  discarded  and  no  indication  is 
given  as  to  how  many  other  priKCsscs  received  the  messjige  or  replied  to  it.  I'his  form  of  communication  is 
the  same  as  one-to-one  reliable  commtmication  to  tlic  first  respondent  and  unreliable  daUigram 
commtinication  to  the  rest  of  tlic  group. 

The  multiple  reply  case  is  similar  u*  tlic  one-reply  case  except  tliat  tltc  second  and  subsequent  reply 
mcssiigcs  arc  queued  in  the  kernel  for  the  sender  to  retrieve,  up  until  the  start  of  the  next  message  transaction, 
i.c.  the  next  Send.  It  is  left  to  the  sending  prtxress  to  decide  how  miiny  replies  it  wishes  to  receive  and  how 
long  it  is  willing  to  wait  for  them. 

The  addition  of  no-reply  and  multiple-reply  Send  presents  somewhat  of  a  departure  from  the  synchronous 
message- passing  present  in  Thoth  tind  carried  forward  in  the  one-to-one  communication  in  V.  Iliis  departure 
is  necessary  in  order  to  accommodate  certain  useful  applications  of  onc-to-many  communication.  'Ihc  model 
of  onc-to-many  commtmication  is  implemented  by  adding  some  new  primitives  to  the  kernel  and  by 
extending  die  semantics  of  some  existing  primitives. 

6.3.2.  New  Primitives 
groupld  -  AUocttteCroupIdO 

Allticatcs  and  returns  a  group  identifier  with  die  invoking  process  becoming  a  member  of 
the  new  group.  Group  identifiers  arc  identical  in  syntax  and  similar  in  semantics  to  prtKCSS 
identifiers. 

JoinGroupf  groupld,  processid  ) 
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Makes  the  process  with  pnrccss  identifier  processld  a  member  of  die  group  with  group 
identifier  groupld.  In  particular,  all  messages  sent  to  groupid  arc  received  by  this  process 
(with  some  degree  of  reliability).  'Hie  kernel  makes  no  effort  to  ensure  that  groupld 
represents  an  existing  group. 

LeaveGroupi  groupld,  processld  ) 

Removes  die  process  with  process  identifier  processld  from  the  group  with  group  identifier 
groupld. 

processld  =  GetReply(  reply  Message) 

Returns  the  next  reply  message  from  a  group  Send  in  replyMessage  and  die  idendty  of  the 
replying  prcKCSS  in  processld.  If  no  reply  message  is  available,  GetReply  returns  with 
processld  set  to  0.  Additional  reply  mcssiigcs  for  diis  particular  transaction  may  be 
received  and,  if  any,  will  be  returned  on  a  later  invocation  of  GetReply.  It  is  thus  left  to  the 
sender  to  decide  on  what  basis  to  determine  diat  it  has  received  enough  replies.  However, 
all  replies  for  a  pardcular  message  transaction  arc  discarded  when  die  process  sends  again, 
thus  starting  a  new  message  transaction. 

'Ilicrc  is  a  range  of  (statically  allocated)  reserved  group  identifiers,  disjoint  from  those  allocated  by 
AllocateGroupIdQ.  The  use  of  these  group  identifiers  is  similar  to  the  well-known  sockets  in  PUP[11)  and 
other  proUK'ol  families. 

6.3.3.  Changes  to  Existing  Primitives 
pid  =  Send(  message,  groupld ) 

Sends  a  message  to  all  processes  that  arc  a  member  of  the  group  with  group  identifier 
groupld. 

Two  flag  bits  arc  reserved  in  the  mcs,sagc  to  indicate  whether  the  process  expects  no  reply, 
a  single  reply  or  many  replies.  If  no  reply  is  expected,  the  Send  returns  immediately  with 
groupld  as  return  value.  'Hie  reliability  of  diis  Send  is  equal  to  that  of  die  underlying  d.Tta 
link  of  the  network.  If  a  reply  is  expected,  the  pKKCss  is  blocked  until  die  first  reply 
arrives.  If  no  reply  arrives  within  a  certain  time  interval,  die  request  is  rctraiisniittcd  and 
eventually,  after  a  number  of  retransmissions,  timed  out.  Multiple  replies  arc  received 
using  GetReply  ix^Kcx  the  liist  one  has  been  received.  I  lowevcr,  retransmissions  stop  after 
the  first  reply  is  received.  Adiliiional  reply  messages  may  be  queued  any  time  up  until  the 
end  of  the  current  message  tr.msaetion.  signaled  the  next  Send  from  this  process. 

processld  =  Receive(  message ) 

We  assume  the  invoking  process  is  a  member  of  the  group  to  which  process  processld  has 
sent.  Receive  returns  processld  and  the  message  exactly  as  if  the  sending  process  had  sent 
directly  to  the  receiver.  'Hie  receiver  is  also  expected  to  Reply  to  processld  exactly  as  in  the 
one-to-one  ease  (rather  dian  to  Reply  to  the  group). 


6.3.4.  Example 

Figure  6-1  gives  a  detailed  description  of  the  exact  sequence  of  events  on  a  multiple-reply  Send.  First,  the 
process  executes  a  Send  to  a  group  identifier  and  indic.itcs  in  the  message  tJiat  it  expects  to  receive  more  than 
one  reply,  flic  process  gets  blocked  and  the  distributed  kernel  takes  care  of  delivery  of  the  message  to  the 
members  of  the  group  (with  daUigram  reliability).  As  soon  as  the  first  reply  arrives,  die  prcKCSS  gets 
unbIcKked  with  the  reply  message  being  returned  from  the  Send,  i  urther  replies  arc  buffered  in  die  kernel. 
If  it  wished  to  do  so,  the  sender  could  stop  receiving  replies  at  this  point.  In  this  example  however,  die  sender 
elects  to  look  at  Iurther  replies,  and  thus  executes  some  number  of  GetReply  calls.  Finally,  after  it  has  seen 
the  third  reply  message,  the  process  decides  not  to  receive  any  further  replies.  If.  as  is  die  ease  in  diis 
example,  further  replies  come  in,  they  will  be  discarded  when  tl.  ■  sender  executes  another  Send. 
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6.4.  Design  Rationale 

Wc  first  argue  that  some  level  of  kernel  support  significantly  enhances  the  ease  and  efficiency  with  which 
one-to-many  communication  can  be  used. 

6.4. 1 .  Kernel  Support 

Superficially,  one  could  claim  dial  onc-lo-many  communication  can  be  implemented  using  multiple  one-to- 
one  messages,  and  that  no  special  kernel  support  is  necessary.  However,  a  number  of  problems  arise  with  this 
approach.  First,  one  may  not  know  the  identity  of  all  die  desired  receivers  of  a  message  if,  for  example,  the 
commiiiiication  is  analogous  in  intent  to  advertising.  Second,  when  dicrc  is  a  large  number  of  recipients,  the 
cost  of  sending  separate  messages  is  significant,  especially  relative  to  dial  possible  when  die  underlying 
hardware  provides  efficieni  broadcast  or  multicast.  Finally,  related  to  efficiency,  some  fonns  of  onc-to-iiiany 
communication  do  not  need  the  full  reliability  associated  with  one-to-one  comniimicalion.  This  unneeded 
reliability,  if  inherited  from  die  one-to-one  communication  primitives,  imposes  a  further  cost  on 
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communication.  Adequate  support  Tor  one-to*many  communication  cun  relieve  all  of  these  problems.  Next, 
we  survey  potential  uses  of  onc-to-many  communication  in  order  to  derive  an  appropriate  level  of  kcmcl 
support 

6.4.2.  Use  of  One-to-Many  Communication 

The  two  generic  uses  of  one-to-many  communication  arc  noiificalion  and  query.  Notification  consists  of 
communicating  specified  information  to  some  group  of  processes.  Query  involves  extracting  specified 
information  from  a  group  of  processes  or  determining  to  some  level  of  certainty  that  the  desired  information 
is  not  available  from  this  group.  Query  can  be  replaced  by  a  notification  facility  using  the  idea  of  advertising 
For  instani;c.  a  client  wishing  to  ItKatc  a  service  may  implement  the  query  by  notifying  the  group  of  server 
processes  of  its  interest  in  a  particular  service.  It  then  waits  to  receive  messages  from  members  of  this  group 
in  response  to  its  advertisement.  An  inverse  approach  used  in  the  Port  system  (51J  is  for  servers  to  advertise 
their  identity  and  services.  A  client  simply  waits  to  receive  an  advertisement  for  the  service  of  interest  to  it 

In  die  V-System,  both  notification  and  query  are  implemented  by  sending  a  mcs.sage  (to  many  processes). 
In  the  case  of  a  notification,  the  replies  to  the  mess<ige  serve  purely  as  acknowledgements.  In  the  case  of  a 
query,  the  replies  carry  reply  data  as  well  as  serving  as  acknowledgements. 

In  first  approximation,  notifications  can  be  classified  according  to  die  number  of  reply  messages 
(acknowledgements  of  receipt)  diat  arc  required.  Similarly,  queries  can  be  clas.sified  according  to  dte  number 
of  responses  diat  are  expected  to  the  query.  We  define  N-reliable  notification  (query),  as  a  notification 
(query),  for  which  N  replies  are  required.  F'or  notifications,  N  ranges  from  zero  to  the  number  of  members  in 
the  group.  For  query.  N  ranges  from  one  to  die  numlKr  of  group  members.  AH-reliable  notification  and 
query  (where  responses  arc  expected  from  all  members  of  die  group)  require  that  the  membership  of  the 
group  be  known. 

The  classification  in  the  above  paragraph  considers  termination  of  a  notification  (query)  depending  on 
receipt  of  a  specified  number  of  replies  by  the  sender.  Ckiscr  observation  reveals  diat  it  is  useful  to  be  able  to 
terminate  notifications  (queries)  when  a  more  general  icniiinaiioii  condition  is  satisfied.  Consider  for  instance 
the  case  where  we  want  to  execute  a  program  remotely,  preferably  on  a  lightly  loaded  pr(x;es.sor.  Fach 
machine  luns  a  icam  server  prtK’ess.  which  maintains  relevant  information  tibout  die  suite  of  ius  machine: 
whether  it  is  aiailable  for  remote  cxaiition,  what  its  current  processor  load  is.  etc.  All  team  servers  belong  by 
default  to  die  predefined  from  server  group.  In  order  to  find  a  siiiuihlc  hosi  for  rcmoic  execution,  we  send  a 
query  to  the  team  server  group,  indicating  that  we  wish  to  receive  many  replies.  Several  strategies  arc  then 
possible.  For  instance,  we  could  stop  receiving  replies  when  an  idle  processor  has  been  foimd,  or.  failing  dial, 
al  ter  a  ccrt.iin  time  iiuerval  or  after  a  number  of  replies  have  been  received,  at  which  point  we  would  pick  the 
processor  with  the  lowest  lotid.  In  this  case,  the  termination  condition  is:  cither  an  idle  proccs.sor  has  been 
found,  or  a  certain  time  interval  has  expired  or  a  certain  number  of  replies  have  been  received.  Similarly,  for 
notification,  a  more  general  termination  condition  might  be,  for  instance,  that  an  acknowledgement  from  a 
particular  priKcss  has  been  received.  Viewed  from  this  angle,  N-response  query  and  notification  arc  special 
case's  wheie  the  termination  condition  states  that  N  replies  must  be  received. 

We  conclude  from  the  above  discussion  that  in  general  some  facility  is  needed  whereby  a  message  is 
delivered  lo  a  group  of  processes,  and  replies  are  returned  until  some  termination  condition  is  satisfied.  We 
have  chosen  to  partition  the  implementation  of  diis  facility  as  follows:  die  kernel  hikes  care  of  delivery  of  die 
mes.sagc  lo  the  group  members  and  returns  die  replies  from  any  number  of  them.  It  is  then  left  to  die  sender 
to  evaluate  die  termination  condition  and  to  daide  if  further  replies  arc  required.  This  division  of  labor  is 
based  on  the  observation  diat  it  is  clearly  undesirable  to  require  die  kernel  to  decide  some  arbitrarily  complex 
termination  condition.  In  fiict.  one  might  even  imagine  cases  where  it  would  be  near  to  impossible  for  die 
kernel  to  make  such  a  dexision,  for  insuinec  when  the  termination  condition  depends  on  information  die 
process  might  have  obtained  via  shared  memory  while  replies  arc  still  outstanding.  Via  the  multiple-response 
option  of  group  Send,  the  kernel  supplies  die  sender  die  basic  mechanism  to  be  able  to  decide  an  arbitritfily 
complex  condition.  The  one-reply  option  could  be  viewed  as  a  special  case  of  this  general  facility,  die 
tcrmin.ition  condition  being  that  one  reply  is  received.  Iliis  special  case  is  supported  by  the  kernel  for  two 
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reasons: 

1.  It  allows  the  simple  ease  of  a  one-to-many  Send  witfi  one  reply  to  appear  to  tlie  sender  as  completely 
identical  to  a  one-to-one  Send.  'Iliis  is  desirable  for  many  simple  applications. 

2.  Ihe  specification  of  a  single  reply  allows  an  optimization  in  tlie  kernel.  Indeed,  all  replies  after  the  first 
one  can  be  discarded  immediately.  Additionally,  this  optimization  can  be  performed  without  any  extra 
mechanism  in  tlie  kernel. 

Another  special  ease  is  all-reliable  one-to-many  Send.  'Ibis  option  can  be  built  on  top  of  the  available 
kernel  primitives  (See  Section  6.4.3  for  a  discussion  of  diftcrent  implementation  strategics).  However,  it  is  not 
supported  by  the  kernel  itself  because  the  benefits  of  such  kernel  support  do  not  seem  to  warrant  the  added 
complexity.  'Ihc  added  complexity  results  from  the  need  to  know  the  membership  of  the  group  in  order  to 
implement  all-reply  one-to-many  Send.  In  general,  tlie  kernel  on  the  sender's  machine  docs  not  possess  a 
record  of  the  membership  of  any  group.  Maintaining  such  knowledge  is  not  required  for  many  applications, 
and  would  be  cumbersome  to  implement  for  groups  where  processes  can  freely  join,  leave  or  be  destroyed. 
For  the  sender  to  supply  a  list  of  member  processes  to  tlie  kernel  would  require  adding  another  primitive  to 
the  kernel.  'Ibis  addition  has  been  rejected  particularly  in  view  of  the  fact  that  a  kernel  implementation  of 
all-response  Send  would  not  result  in  better  performance  in  terms  of  network  or  prcxrcssor  utilization,  as  is 
shown  next. 

6.4.3.  Implementation  of  All-Replies  Send 

Several  methods  can  be  used  to  implement  all-reliable  one-to-many  Send  on  top  of  the  available  kernel 
primitives.  First,  a  process  can  use  repeated  group  Sends  until  it  receives  replies  from  all  members  of  the 
group.  Ibis  method  is  not  recommended  because  of  its  high  ccist  in  terms  of  the  number  packet  events.  A 
packet  event  is  defined  as  the  transmis-sion  or  reception  of  a  packet  (See  Section  5.3.2).  Ibc  number  of  packet 
events  is  indicative  of  the  processor  time  spent  in  communicating.  Given  that  prwessor  usage  is  a  critical 
aspect  of  local  network  performance,  a  large  number  of  packet  events  may  degrade  performance  significantly. 
With  N  members  in  a  group  not  including  the  sender  (and  all  menibci'S  on  different  machines),  the  above 
implementation  requires  one  multicast  packet  and  N  p«>int-to- point  acknowledgement  p,icket  per 
transmission.  Ibe  cost  in  packet  events  is  3N-(-l  without  errors  and  an  extra  3N  +  1  for  aich  error.  Ibis 
contrasts  with  4(N-l)  packet  events  for  N-l  point-to-point  errorless  mcssigcs  and  4  packet  events  for  each 
additional  error.  Ibus.  tlie  multicast  scheme  is  more  expensive  in  packet  events  than  the  point-to-point 
simulation  of  multicast  if  dicre  arc  errors  and  N  is  greater  tlian  1. 

Note  that  one  might  expect  U)  lose  packets  in  reply  to  a  multicast  packet  because  each  of  many  reply  pxkets 
could  arrive  simultaneously  (back-to-back)  at  the  sending  host,  causing  it  to  fall  behind  and  drop  some  of  the 
packets.  Ibe  solutions  to  this  proposed  by  MiKkapctris  [55)  arc  to  use  sophisticated  filler  network  interfaces 
to  offload  tlie  prix:es.sor  and  to  use  two  multicast  addresses  for  one  group,  alternating  between  them  on  each 
packet  so  Uiat  rctransmis.sions  only  appear  on  the  previous  address  and  arc  not  seen  twice.  Ibe  former  arc  not 
available  and  the  latter  only  works  if  tlicrc  is  only  one  sending  priKCss.  a  restriction  that  is  unworkable  in  our 
applications. 

A  more  appropriate  strategy  is  to  Send  the  request  to  die  group  by  one-to-many  communication,  collect  the 
replies,  and  then  retransmit  the  mess;ige  one-to-one  to  those  group  members  that  did  not  (timely)  reply  to  tlie 
original  Send.  This  avoids  an  excessively  high  cost  in  packet  events,  l-or  instance,  if  two  pr(x:cs.ses  failed  to 
respond,  it  would  cost  3N  -f  9  packet  events  (assuming  no  crroi-s  on  the  one-to-one  communication)  versus 
6N  -I-  2  for  a  simple  retransmission.  As  alluded  to  previously,  note  that  the  same  cost  is  incurred  regardless  of 
whether  all-reliable  one-to-many  Send  is  supported  by  tlie  kernel  or  implemented  by  tlie  sender  on  top  of 
multiple-response  one-to-many  Send,  as  provided  by  the  kernel. 

'Ibe  above  method  uses  suindard  positive  acknowledgement  and  retransmission  for  reliable  delivery.  'Ibis 
guarantees  that  after  a  number  of  retransmissions  the  sender  knows  th.it  the  mcss.igc  is  delivered  to  all 
operational  processes  in  the  group,  and  that  the  other  members  are  currently  in.icccssible.  An  alternative 
.strategy  provides  a  somewhat  different  form  of  reliability:  it  gu.irantces  ih.ii  esery  message  will  eventually  be 
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received  by  each  process  in  the  group  (assuming  the  process  expresses  an  interest  in  receiving  the  message), 
litis  scheme,  analogous  to  publishing  used  in  real  world  one-to-many  communication,  places  a  greater 
burden  on  the  receivers,  basically,  information  to  a  group,  the  subscribers,  is  filtered  through  the  publisher 
which  collates  and  numbers  the  information  before  issuing  it  to  the  subscribers.  Just  as  with  a  real  world 
publication,  it  is  the  subscribers  that  implement  the  reliable  reception  (to  the  degree  they  require),  not  the 
publisher  implementing  reliable  delivery.  In  particular,  a  subscriber  notices  that  an  issue  has  not  been 
received  by  a  gap  in  the  issue  numbers  of  ksucs  received  or  because  a  new  is.sue  was  not  received  in  the 
expected  time  interval,  in  the  ease  of  regular  publications.  On  detecting  a  missing  issue,  the  subscriber  may 
request  the  missing  back  issue  from  the  publisher.  Issue  numbers  arc  also  used  to  detect  receiving  duplicate 
issues.  Clearly,  the  publication  mtxlcl  can  be  implemented  using  a  combination  of  a  reliable  one-to-one  Send 
to  the  publisher,  unreliable  one-to-many  Send  from  the  publisher  to  the  subscribers,  and  reliable  one-to-one 
Send  from  the  subscribers  to  the  publisher  to  obtain  back  issues. 

A  slight  variant  on  the  above  uses  a  logging  process  instead  of  a  publisher.  Kach  group  has  a  logging 
process.  When  a  notification  is  sent  to  the  gnmp,  by  arrangement  only  the  logging  process  responds.  ITie 
message  is  retransmitted  until  the  lugging  process  responds  thus  providing  reliable  transport  to  the  log.  A 
group  member  that  is  concerned  about  a  missed  group  message  can  consult  the  logging  process  to  extract  the 
message.  Hiis  provides  mure  immediate  communication  with  the  group  (avoiding  the  delay  of  first  sending 
to  die  publisher  on  each  mcs.sagc)  but  makes  it  difficult  to  impose  a  strict  ordering  on  messages  received  from 
different  senders. 

A  further  variant  on  this  scheme  by  Chang  and  Maxemchuck  [12]  is  described  in  Section  6.7. 

6.4.4.  Groups 

ITic  definition  of  group  is  open  in  a  number  of  ways.  Fiist  proces.ses  from  outside  die  group  may  send  to 
the  group  using  the  group  identifier.  ITiis  is  required  to  allow  a  client  process  to  send  to  a  group  of  servers,  a 
common  use  in  our  experience.  A  group  model  whereby  only  group  members  are  allowed  to  send  to  the 
group  can  be  derived  from  die  basic  model  by  having  member  processes  discard  all  mess.igcs  but  duise 
believed  to  be  originated  within  the  group. 

Second,  any  process  can  join  or  leave  a  group.  Although  Uiis  is  fine  .as  a  basic  model,  without  any  additional 
mechanism  it  poses  a  serious  protection  problem.  Kor  instance,  a  process  might  join  a  group  and  disrupt  its 
proper  operation  by  sending  bogus  replies  to  group  mcs.s;igcs.  We  rccogni/c  that  control  is  needed  over  group 
mcmbcrsliip  and  plan  to  implement  a  means  of  refusing  membership  either  based  on  user  identification  or 
objection  from  existing  mcmlKrs  of  die  group. 


6.5.  Implementation 

We  describe  an  implementation  for  the  3  Mb  Kthemet  and  the  10  Mb  Hthernet.  ITie  implementation  deals 
with  the  communication  aspect  of  multicasL  and  the  allocation  and  management  of  multicast  addresses  and 
group  identifiers. 


6.5.1.  Allocation  of  Group  Identifiers  and  Multicast  Addresses 

When  a  AllocateGroupIdQ  is  executed,  a  unique  group  identifier  is  alUxralcd  and  a  multicast  address  is 
assixriatcd  with  it.  Hie  allocation  of  group  idcniificrs  and  their  mapping  to  multicast  .addresses  is  subjat  to 
die  following  two  requirements: 

1.  I  hc  group  identifier  name  space  and  tlie  process  identifier  name  space  need  to  be  strictly  disjoint,  with 
tlic  confines  of  the  name  space  offered  by  a  32-bit  identifier. 

2.  flic  multicast  address  needs  to  be  derivable  from  die  group  identifier.  When  a  JoinGroup(groupld.pid) 
is  cxaulcd.  die  kernel  has  to  instruct  the  l''lhcrnct  interfiicc  to  accept  packets  for  die  corresponding 
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multicast  host  address^.  In  other  words,  one  cannot  rely  on  inferrini.;.  the  mapping  from  incoming 
packets  as  for  point-to-point  communication  (See  Section  S.4.2).  since  the  host  address  needs  to  be 
known  before  any  messages  addressed  to  the  group  can  be  received. 

AddiUonally,  the  following  properties  are  desirable: 

1.  The  test  whether  an  identifier  is  a  process  identifier  or  a  group  identifier  should  be  simple  and  eflkienL 

2.  As  fur  process  identifiers,  we  would  like  to  allow  for  distributed  generation  of  group  identifiers  while 
still  guaranteeing  their  uniqueness. 

3.  Finally,  we  would  like  to  achieve  the  situation  where  a  multicast  packet  resulting  from  a  group  Send 
does  nut  cause  any  processor  utilization  on  processors  which  have  no  members  in  the  group.  Ihis 
requires  a  one-to-one  mapping  from  groups  to  multicast  addresses. 

In  practice,  group  identifiers  are  allocated  as  follows.  Ihe  most  significant  bit  of  a  group  identifier  is  always 
on,  while  the  most  significant  bit  of  a  process  identifier  is  always  off.  1his  reduces  the  test  whether  a  32-bit 
identifier  is  a  group  identifier  or  a  process  identifier  to  a  simple  bit  test,  'this  multicast  bit  reduces  the 
available  name  space  for  group  identifiers  to  31  bits,  allowing  for  2^'  unique  identifiers.  Within  a  name  space 
of  tiiai  si/e,  an  acceptable  degree  of  uniqueness  can  be  achieved  by  simply  letting  each  host  pick  a  random 
number  when  it  needs  to  allocate  a  group  identifier.  The  probability  of  a  collision  is  approximately  mVn, 
where  m  is  tlic  number  of  groups  and  n  is  tlie  si/e  of  the  name  space  (Sec  Section  S.3.2).  For  reasonable 
values  of  m,  this  probability  is  quite  small.  I'or  instance,  for  m  =  32,  the  probability  of  a  collision  is  0.0001%. 
If  deemed  necessary,  this  probability  could  be  lowered  even  further  using  a  collision  detection  technique 
similar  to  the  one  used  for  detecting  collisions  of  logical  host  identifiers. 

The  allocation  of  multicast  addresses  is  evidently  dependent  on  the  particular  network  used.  In  the  10  Mb 
Hthcmct  specification  (27J.  large  blocks  of  multicast  addresses  can  be  assigned  to  particular  vendors.  Thus, 
the  ideal  of  having  a  separate  multicast  address  for  each  group  could  be  achieved  .  Pending  the  assignment 
of  such  a  bliKk,  we  have  Uiken  the  liberty  of  using  a  (static)  set  of  multicast  addresses.  N  bits  of  the  group 
identifier  indicate  the  corresponding  multicast  address,  tlirough  a  Uible  lookup.  On  our  3  Mb  Htlicrnet,  we 
alliKate  a  limited  number  of  host  addrcs.ses  (8)  that  are  currently  unused,  and  priKCed  as  with  die  10  Mb^K 


6.5.2.  Joining  and  Leaving  a  Group 

When  a  JoinGroup(groupld,proccssld)  is  executed,  ihc  kernel  of  the  machine  on  which  the  pnxrcss  with 
priKcss  idenlincr/>ro<'c.s5/i/ resides  allocates  a  group  descriptor,  containing  Ihc  mapping  from  group  identifier 
to  process  idcnlificr.  Ibis  mapping  only  needs  to  be  mainuiincd  on  Uiat  particular  machine.  If  the 
JoinGroup(gnupld,processld)  is  executed  on  a  machine  dificrent  from  the  one  on  which  the  process  with 
pnxrcss  identifier  processld  resides,  the  request  is  forwarded  between  the  kernel  on  which  the  request  was 
issued  and  processid's  kernel  by  the  intcrkcrncl  prokKol.  Leaving  the  group  results  in  the  group  descriptor 
being  deleted. 

6.5.3.  Sending  to  a  Group 

When  a  Scnd  'is  done  to  a  group,  die  message  packet  is  sent  to  the  assixrialcd  multicast  address  derived  from 
die  group  identifier.  Ibc  simple  check  for  a  group  identifier  is  important  in  cITicicnlly  dcta'ting  and  handling 
this  case.  In  particular,  it  allows  onc-to-many  communication  to  be  implemented  with  absolutely  minimal 
overhead  for  one-to-one  communication  (one  extra  bit  test).  If  no  reply  is  expected,  the  prwess  is  not 
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By  dcfaull.  Uic  l^lhcrncl  interface  is  only  open  for  broadcast  and  for  the  local  network  address. 

30 

Currently  .ivailablc  iiiicrfaccs  severely  limit  Ihc  number  of  muliicasi  .addresses  lh.at  can  be  received. 
^^Our  3  Mb  network  interface  can  be  opened  for  any  suirsci  of  ihc  2S6  pos.sil)lc  3  Mb  network  addresses. 
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blocked.  Otherwise,  the  process  is  blocked  as  for  a  normal  Seitdiw'iih  the  possibility  of  retransmissions,  etc.) 
until  the  first  reply  comes  in.  which  is  returned  by  the  Send. 

In  the  case  of  multiple  replies,  the  second  and  subsequent  replies  are  buffered  in  the  kernel  and  queued  for 
reception  by  the  sender  using  GetReply.  Ibe  next  invocation  of  Send  discards  all  queued  replies^^.  For  each 
process,  we  maintain  a  sequence  number,  which  is  incremented  on  every  Send,  'lliis  allows  us  to  discard 
delayed  replies  to  earlier  message  transactions,  and  ensures  that  only  replies  to  the  current  message 
transaction  arc  queued  in  the  kernel  and  eventually  returned  to  the  sender. 


6.5.4.  Receiving  a  Message  from  a  Group  Send 

When  the  kernel  receives  a  multicast  packet,  it  checks  the  group  descriptors  to  determine  which  local 
processes  belong  to  this  group.  This  check  is  done  quickly  by  using  a  hash  table.  A  copy  of  tlic  message  is 
linked  into  the  message  queue  of  each  of  the  member  prcKCSscs  (identical  as  for  tme-to-one  mcs.sagcs).  litis 
mechanism  also  works  for  group  members  on  the  siimc  machine  as  the  originator  of  the  mcs.s<igc  so  no  extra 
code  is  required  for  local  members  of  a  group^^ 

6.5.5.  Replies  to  a  Group  Send 

If  a  group  mcssiigc  has  the  no-rcply  bit  set.  replies  to  this  message  arc  short-circuited  on  the  machine  from 
where  tltc  Reply  is  executed  and  cause  no  network  traffic.  For  one  reply  messages,  replies  arc  always 
transmitted  but  only  the  first  one  to  arrive  at  the  sender  has  effect.  I'or  multiple  replies,  the  kernel  attempts 
to  receive  and  buffer  all  reply  messages  until  the  end  of  the  associated  message  transaction,  subject  to 
available  buffer  space. 

6.5.6.  Performance 

Ideally,  we  would  like  to  achieve  the  situation  where  delivery  of  a  message  to  N  processes,  using  onc-to- 
many  Send  is  N  times  faster  than  using  repeated  one-to-one  Sends.  Additionally,  the  cost  should  be 
comparable  to  using  hardware  multicast  for  the  same  purpose.  In  our  current  implementation,  a  0-rcply 
group  .Venrf  takes  milliseconds  while  a  1 -reply  group  Send  takes  basically  the  s;imc  time  as  a  one-to-one 
Send,  namely  2.2  milliseconds.  I  hc  time  for  a  multiple-reply  group  Send  should  be  the  same  to  get  the  first 
reply  and  then  roughly  0.2  millisecond  per  message  to  get  the  other  replies  assuming  they  have  arrived  when 
GetReply  is  called. 


6.6.  Applications 

Conventional  applications  of  broadcast  and  multicast  can  be  built  nicely  on  top  of  our  group 
communication  mechanism,  tliereby  removing  all  network-related  code  from  these  .applications,  as  discussed 
below.  We  also  describe  the  use  of  one-to-many  communication  in  highly  parallel  distributed  computations. 

6.6.1.  Amaze:  A  Multi-Player, Multi-MachineGame 

Ama/c  [8J  is  a  'milli-playcr  game  program  that  runs  on  a  set  of  peisonal  workstations  connected  by  a  local 
network,  l-^tch  player  maneuvers  his  representative  "monster"  through  a  maze  with  tlie  objective  of  shooting 
the  monsters  controlled  by  other  players  without  allowing  his  own  monster  to  be  shot.  Ivich  player  has  a 
personal  workstation  running  a  copy  of  Ama/c.  providing  Iwal  display  of  tlic  maze  and  monsters,  and 
communication  with  other  player  workstations  tlirough  a  local  network.  Only  the  worksLations  of 


timer  with  a  suiiahly  targe  limeoui  interval  takes  care  of  the  case  when  no  further  Sends  arc  done. 

^^fhis  assumes  that  the  network  interface  is  capable  of  receiving  its  own  packets  Not  all  interfaces  have  this  desirable  property. 
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participating  players  can  be  relied  on  to  maintain  the  state  of  the  game  and  players  are  free  to  join  and  quit  at 
arbitrary  times,  ’fhus,  there  cannot  be  a  single  site  for  the  global  game  state,  which  must  therefore  be 
replicated,  lliis  problem  is  quite  typical  of  a  number  of  distributed  programming  applications  and  therefore 
of  general  interest 

Our  previous  experience  with  di.stributed  games  stems  from  using  Xerox  Alto  aimputers  which  run  multi¬ 
player  game  programs  such  as  Maze  Wars.  Maze  Wars,  like  other  games  run  on  the  Altos,  uses  Ethernet 
multicast  to  communicate  game  state  updates  to  the  worksuitions  running  the  game.  A  state  update  is  sent  to 
a  particular  multicast  address  for  the  game.  All  players  listen  for  updates  on  tliat  multicast  address.  Amaze  is 
an  attempt  to  build  a  more  network-independent  multi-machine  game. 

'llte  original  implementation  of  Amaze  relies  entirely  on  the  one-to-one  communication  primitives  of  the  V 
kernel.  A  game  manager  process  maintains  the  ItKal  copy  of  the  game  state  on  each  workstation  and  also 
ensures  that  the  workstation  display  accurately  reflects  the  current  game  state.  The  game  manager  performs 
state  updates  in  response  to  mes.sages  from  its  helper  processes:  a  timer  process,  a  keyboard  reader  and  one 
status  inquirer  process  per  remote  player.  A  remote  slate  update  is  reported  to  tiic  game  manager  by  a  status 
inquirer  that  is  dedicated  to  reading  and  reporting  the  state  of  a  particular  monster  running  on  another 
machine.  Ihe  inquirer  sends  a  message  to  the  remote  monster's  manager  requesting  a  status  update  and  then 
pauses  to  await  the  event.  'Ihe  remote  game  manager  only  replies  when  a  change  of  suite  occurs  ItKally  that 
has  not  previously  been  reported^.  Having  obtained  the  update,  the  inquirer  sends  a  mcs.sagc  reporting  it  to 
its  own  manager. 

Using  the  group  mechanism,  no  status  inquirers  arc  necessary  at  all.  1hc  game  manager  simply  joins  the 
group  and  listens  to  updates  from  other  players.  It  sends  state  updates  effected  locally  to  the  group,  using 
no-reply  Sends.  Additionally  each  game  manager  periodically  sends  its  state  to  the  group  (regardless  of 
whether  it  has  a  state  update  to  report)  to  indicate  that  its  workstation  is  still  participating  in  the  game. 

1hc  implementation  using  the  group  mechanism  compares  favorably  with  the  original  point-to-point 
implcmcnuition  on  several  counts: 

1.  In  terms  of  cost,  the  group  implementation  is  better  both  in  network  bandwidth  as  well  as  in  the  overall 
number  of  piKket  events. 

For  a  game  with  n  participants,  a  suite  update  requires  I  network  packet,  compared  to  2(n-l)  using 
point-to-point  connections.  The  number  of  packet  events  is  n  -f  I  per  suite  update,  compared  to  4(n- 1) 
before.  I  he  "steady-state"  traffic  (when  there  arc  no  suite  updates)  accouiiLs  for  n  packets  per  timeout 
interval  when  using  the  group  mechanism  compared  to  2n(n-l)  originally.  Ihe  number  of  packet  events 
due  to  steady-state  traffic  is  n^  as  opposed  to  4n(n-l). 

On  the  down  side,  the  application  must  itself  ensure  periodic  transmission  during  quiet  time  intervals. 
For  one-to-one  communication,  the  kernel  takes  care  of  such  chores. 

2.  Fewer  processes  arc  necessary:  3  when  using  the  group  mechanism  (timer,  keyboard  and  game 
manager)  compared  to  n-t-2  when  using  point-to-point  communication  (n-1  additional  swtus  inquirers). 

3.  'Ihc  interval  between  the  periodic  state  updates  can  be  freely  set  by  the  application  and  is  not  bound  by 
the  retransmis.sion  interval  of  tlic  kernel.  Ihe  fiict  that  the  latter  interval  is  fairly  long  (2.5  sec)  was  a 
source  of  annoying  temporary  inconsistencies  in  the  original  implcmcnuition. 

Note  that  while  using  multicast,  tlic  new  Amaze  program  docs  not  contain  any  network -specific  code  (as  the 
Alto  games  do)  and  tlicrcforc  remains  relatively  portable  to  other  network  technologies.  In  fact,  the  game  has 
been  ported  from  tlic  3  Mb  l''thcrnct  to  tlic  10  Mb  Ethernet  without  any  modification. 
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Although  no  proccs!- level  nKSi.s.igc  tralTic  is  present  during  these  "quiet"  iieriods.  the  sender's  kernel  periodic.-illy  polls  the  rentier's 
kernel  in  order  In  dtsiingutsh  a  "quiet"  state  Troni  the  case  where  the  replier  is  dead.  I'his  causes  2  network  packets  to  be  transmitted 
during  each  polling  interval  (2.5  seconds). 


I 


MESSAGE  PASSING  ON  A  1 OCAI  Nin^ORK 


AMAZE.  A  MUl.TI-PIAYER.  MULTI-MACHINE  GAME 


91 


The  game  program  is  an  example  of  an  application  of  one-io-many  communicalion  tliat  falls  under  the 
general  category  of  unreliable  nolificaiion.  It  is  not  important  that  all  notifications  arrive  at  all  destinations, 
since  subsequent  updates  subsume  previous  ones  (Hach  update  carries  absolute  state  information.)  A  number 
of  practical  systems  exhibit  this  property.  However,  some  applications  require  tliat  all  notifications  reach  all 
accessible  destinations,  such  as  for  distributed  dauibase  updates.  Iliese  applications  would  use  die  reliable 
notification  techniques  discussed  earlier. 

6.6.2.  Locating  Servers 

The  previous  example  fits  under  the  notification  paradigm.  ITie  other  major  use  of  onc-to-many 
communication  is  for  queries. 

Locating  a  service  in  the  V-System  involves  finding  out  the  prixress  identifier  of  a  process  implementing  the 
service.  All  processes  implementing  a  particular  service  form  a  group  representing  that  service.  Iliis  group  is 
identified  by  one  of  the  predefined  group  identifiers  (See  Section  6.3).  When  a  server  starts  up,  it  executes  a 

JoinGroup(scrverGroupld,  myPid).  Clients  trying  to  locate  a  service  tlien  execute  a  Send(msg, 
serverGroupld).  the  requestcode  of  the  message  being  Query,  one  of  the  standard  system  rcqucstcodes.  By 
convention,  all  servers  know  how  to  process  such  a  Query  request  and  respond  to  it  appropriately. 

Ihe  original  V  kernel  had  special  kernel  operations  for  the  s;ime  purpose.  SetPid  and  GeiPiiL  For  a 
definition  of  these  primitives,  see  Section  3.2;  for  a  description  of  their  implementation,  see  Section  5.6.3.  A 
number  of  problems  were  identified  with  tliis  approach.  First  special  kernel  support  was  present  altliough  it 
did  not  seem  inherently  neces.sary.  Instead,  it  was  merely  a  result  of  die  absence  of  one-to-many 
communication  as  a  generally  available  facility  at  die  kernel  interface,  fhese  extra  primitives  did  not  only 
require  extra  code  and  data  structures  in  the  kernel,  but  also  special  packet  types  in  dte  interkcrnci  protocol, 
and  assiK'iatcd  packet  handlers.  Second,  the  approach  was  prone  to  annoying  inconsistencies.  Once  a  process 
was  registered,  it  remained  registered  until  its  registration  was  overridden  by  another  process.  So,  the  kernel 
would  (K'casionally  return  an  invalid  priKCSS  identifier.  The  situation  was  somewhat  ameliorated  by  having 
the  kernel  clieek  the  validity  of  the  priK'css  identifiers  it  returned.  However,  this  precluded  registering  remote 
priKcsscs  or  groups  (because  the  kernel  itself  cannot  check  validity  of  remote  processes).  The  approach  was 
also  found  to  he  imnccessiirily  inllcxibic.  Only  a  sinpe  nuKlificr  could  be  added  to  either  a  server  registration 
or  a  client  loc.ition  request,  llesides,  it  was  an  all-or-nothing  .ipproach:  a  server  was  either  registered  or  noL  it 
could  not  selcctivel)  respond  or  refuse  to  respond  to  certain  requests.  Finally,  the  approach  was  somewhat 
vulner.ihic  to  bogus  processes  registering  tltcmselves. 

Using  .1  one-to-many  query  to  the  scrvci's  tliemselves  has  many  advantages  over  the  original  mechanism. 
First,  no  kernel  support  is  necessary  beyond  what  is  already  available.  Second,  there  is  no  danger  of 
penn.incnt  inconsistency  since  the  servers  themselves  respond  to  queries.  I'hird.  the  mechanism  allows  the 
servers  grc.it  (lexibility  in  responding  to  lookups.  For  instance,  a  server  can  respond  only  to  autliori/,cd  users, 
when  its  io.id  is  not  too  high.  etc. 

With  respect  to  onc-to-many  communication,  it  allows  us  to  demonstrate  an  example  of  both  one-reply  and 
multiplc-rcplv  query.  Simple  applications  arc  typically  inlercsied  in  locating  a  generic  service,  without 
worrying  much  about  which  particular  instance  of  tliat  service  they  obtain  as  the  result  of  their  query.  'Hie 
one  reply  opiion  of  query  is  adequate  for  tlicir  purposes.  More  sophislicatcil  .ipplications  might  elect  to 
receive  .i  nuiulK'i  of  replies.  I  he  applic.ition  invtilving  remote  execution  discussed  in  Section  6.4  is  a  good 
cx.implc.  .Adiiiliunally.  the  mulliplc-rcply  mode  of  operation  alleviates  .somewhat  the  problem  of  a  bogus 
process  registering  itself  as  a  certain  kind  of  server.  Such  a  process  can  still  join  a  group  and  respond  to  Query 
requests,  but  since  many  replies  can  be  received  to  such  a  Query,  the  client  at  least  gets  a  choice  of  server. 

6.6.3.  Distributed  Computation 

A  class  of  applic.ilions  of  special  interest  t<t  us  invatives  large-scale,  highly  p.irallel  distributed  computations. 

I  lere  wc  cm  ision  many  processes  running  on  separate  processors  (possibly  luov  ided  by  sep.irate  worksLilions) 
working  in  pai.illel  to  solve  a  Cl’L'-intcnsive  problem.  One-t<i-many  comimiiiic.itioii  is  used  to  query  the 
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progress  of  other  processes  as  well  as  notify  other  processes  of  new  insights  discovered.  In  general,  we 
structure  the  sharing  of  die  growing  program  knowledge  base  in  a  fashion  similar  to  the  way  in  which 
researchers  share  their  knowledge.  Results  and  queries  may  go  one-to-one  to  closely  cooperating  colleagues, 
as  with  "private  communications".  Important  results  arc  submitted  for  publication  to  a  publisher  process. 
The  publisher  then  decides  whether  or  not  to  distribute  that  information.  More  importantly,  the  publisher 
takes  on  the  job  of  distribution  and  maintenance  of  back  issues,  freeing  the  researcher  to  continue  its  work. 

As  an  example  of  a  distributed  computation  using  onc-to-many  communication,  consider  a  parallel  a-/3 
search  (See  Figure  6-2).  Several  searcher  processes  explore  difl'crcnt  branches  of  the  game  tree  with  some 
initial  search  window.  All  these  proccs.scs  belong  to  a  single  group,  together  with  a  manager  process.  As  the 
individual  processes  evaluate  their  subtrees,  they  obtain  narrower  bounds  on  the  overall  search  window. 
These  narrower  bounds  arc  of  interest  to  all  searchers  since  they  might  prune  parts  of  other  subtrees  as  well. 
So  any  new  bounds  arc  announced  to  the  group  by  one-reliable  notification.  I)y  arrangement,  only  the 
publisher  ever  responds  to  a  message  sent  to  the  group,  thereby  providing  one-reliable  notification  to  the 
publisher  process. 


r 


Figure  6-2:  a-ft  Search  on  a  Collection  of  Workstations 

Note  th.1t  the  one-reliable  aspect  of  the  notification  is  essential  in  order  to  ensure  dial  the  final  result  of  the 
computation  is  correct:  it  makes  sure  that  the  publisher  sees  all  results,  and  deduces  correctly  the  best  move 
for  the  given  position.  If  one  of  the  searcher  procc.sscs  misses  a  mes.sage,  this  does  not  affect  die  correctness  of 
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tlic  algorithtti.  Il  might  cause  this  process  to  do  some  extra  work  but  the  final  result  will  remain  the  same. 
Similarly,  a  program  to  solve  the  traveling  salesman  problem  could  use  simple  notification  to  communicate 
the  current  least  cost  path  to  all  parallel  searchers,  reducing  Uie  exploration  of  inferior  routes. 

6.7.  Related  Work 

Since  die  subject  of  onc-to-maiiy  communication  is  somewhat  distinct  from  the  subject  matter  of  the  rest  of 
this  thesis,  we  include  in  this  chapter  a  separate  section  describing  related  work  on  broadcast  and  muldcast 
communication. 

One  area  of  related  work  concerns  the  provision  of  broadcast  and  multicast  at  the  data  link  and  network 
levels.  In  particular,  die  Hthcinct  [54]  bi'oadcast  network  requires  only  multicast  addressing  and  filtering  in 
the  hosts  to  provide  multicast  delivery.  Implementing  broadcast  and  multicast  on  point-to-point  networks  or 
internetworks  has  been  covered  by  a  number  of  audiors,  including  Dalai  and  Metcalfe  [25],  Wall  [76]  and 
Boggs  [10]  All  this  work  addresses  the  efficient  multicast  delivery  of  packets  to  hosts,  and  thereby  provides 
the  necessary  underlying  mechanism  for  efficient  one-to-many  inlcrprocess  delivery  as  addressed  by  the 
V-System. 

Ihc  use  of  the  broadcast  and  multicast  capabilities  of  l<Kal  networks  has  been  the  subject  of  other  related 
work,  Boggs'  diesis  [10]  being  die  primary  reference.  Our  work  f(x;uscs  on  proposing  an  application-level 
interface  .o  onc-to-many  communication,  a  subject  which  to  the  best  of  our  knowledge  has  not  been 
addressed  in  previous  work. 

Finally.  Chang  and  Maxemchuck  [12]  describe  a  reliable  broadcast  protocol  similar  to  our  publication 
model.  I  hey  define  reliable  broadcast  to  a  set  of  machines  to  guarantee  that  all  machines  receive  the  same 
sequence  of  notification  mcss;igcs.  Summarizing  tlieir  implemcndition.  each  notification  is  broadcast  to  all 
machines  A  token  host  assigns  a  sequence  number  to  the  mes.sage  (incremented  by  one  ftir  every  message), 
and  then  broadcasts  an  acknowledgement  containing  this  sequence  number.  Again,  as  in  the  publication 
model,  the  onus  is  on  the  receivers  to  request  retransmissions  from  the  token  maehine.  The  broadcast 
ackiiowlcilgement  seems  dilficult  to  fomializc  in  the  request-response  paradigm  of  die  V  world.  Additionally, 
their  work  contains  algoriihnis  for  transfer  and  generation  of  tokens,  and  correctness  proofs  of  the  protwols 
involved. 


6.8.  Chapter  Summary 

One-to-many  communication  is  useful  in  distributed  systems  in  the  absence  of  global  shared  memory, 
which  provides  this  facility  implicitly.  I  hc  performance  of  such  a  facility  is  important  for  highly  parallel 
distributed  computation.  With  this  motivation,  wc  have  extended  the  V  kernel  to  provide  a  simple  group 
Scad  form  of  one-to-many  communication  with  die  option  of  zero,  one  or  multiple  replies.  Ihis  simple 
mechanism  supports  the  common  forms  of  query  and  notification  and  can  be  used  by  the  application  to 
implement  fully  reliable  notification  and  query  if  required.  Wc  argued  against  implementing  fully  reliable 
multicast  directly  in  die  kernel.  Finally,  wc  have  described  some  initial  applications. 

Several  questions  remain  unanswered  at  this  point.  I'irst.  as  wc  already  pointed  out.  a  protc'ction  issue  is 
raised  by  the  very  open  structure  of  priKCss  groups.  We  plan  to  implement  a  mechanism  whereby 
membership  of  a  group  can  he  refused  based  on  lack  of  user  authentication  or  based  on  objections  by  other 
group  members.  Second,  wc  have  ignoicd  die  interference  of  onc-to-many  communication  with  segment 
access  as  used  for  MoveTo  and  MoveFrom.  Currently,  wc  allow  concurrent  read  access  to  a  segment.  If  write 
access  is  given  to  a  segment,  die  first  replier  is  given  exclusive  access  to  die  segment.  Wc  do  not  have  much 
evidence  to  indicate  whether  diis  semantics  is  appropriate.  Finally,  wc  arc  interested  in  exploring  the 
potential  for  large-scale  parallel  computation  of  a  system  consisting  of  prixrcssors  interconnected  by  a 
broadcast  network. 
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Conclusions 


7.1 .  Summary 

In  this  Uicsis  wc  have  presented  the  design,  the  implementation  and  tlie  evaluation  of  a  transparent 
message-based  interprocess  communication  mechanism  on  a  local  area  network. 

Chapter  3  contains  the  definition  of  the  interprocess  communication  primitives  and  measurements  of  their 
performance  on  SUN  workstations  interconnected  by  a  3  or  a  10  Mb  Ftlicrnct  network.  We  introduced  the 
notion  of  network  penalty  a&  the  minimum  cost  for  tran.sfcrring  data  between  machines  over  a  network,  given 
a  particular  combination  of  processor,  network  and  network  interface.  Based  on  this  notitm.  we  estimated 
lower  bounds  on  the  elapsed  time  for  the  intercommunication  primitives  when  implemented  between 
machines  in  the  given  configunition.  Wc  showed  that  the  elapsed  time  for  a  message  exchange  is  only  one 
millisecond  higher  than  the  lower  bound,  and  tJiat  the  elapsed  time  for  transferring  one  kilobyte  of  data  is  less 
than  one  millisecond  above  the  lower  bound.  Furthcnnorc,  wc  have  compared  die  performance  of  file  access 
and  pipes  when  implemented  on  top  of  the  message  passing  primitives  to  estimates  of  tlic  performance  of 
those  applications  when  supported  by  a  dedicated  protocol. 

Ihc  performance  of  network  file  access  was  further  investigated  in  Chapter  4.  The  measurements  in 
Chapter  3  were  done  in  an  otherwise  idle  environment  and  in  a  particular  hardware  configuration.  To  as.scss 
performance  under  load,  and  to  investigate  how  performance  would  change  between  different  configurations, 
wc  built  a  queueing  network  model  of  file  access  from  diskless  workstations  over  a  local  network  to  a  set  of 
file  servers.  Results  from  the  model  indicate  that  transparent  file  access  is  possible  wiihotit  significant 
performance  degradation,  for  mtxlcratc  numbers  of  workstations.  Under  a  plausible  set  of  asstimpiions 
(incltiding  those  present  in  die  baseline  configuration),  die  file  server  Cl’L  is  shown  to  be  the  lM)ttlcneck.  Wc 
have  explored  a  number  of  options  whereby  this  bottleneck  could  be  relieved.  In  parlktdar,  using  large 
client-server  interaction  sizes,  introducing  a  file  server  cache  and  using  a  second  file  server  offered  good 
potential  lor  improving  pcrfonnancc  under  high  load. 

I'he  pmtocol  underlying  the  mcs.s;igc  passing  primitives  was  studied  in  Chapter  5.  FITiciency  was  a  primary 
concern  during  the  design  of  the  proKKol  and  its  implementation.  Ihis  emphasis  is  reflected  in  this  chapter  in 
discussing  methods  for  priKCSS  identifier  generation,  prrxrcss  location,  mcss;igc  passing  and  data  transfer.  Of 
particular  interest  was  tlic  discussion  of  the  efficiency  of  moving  large  amounts  of  data  reliably  across  a  Iwal 
area  network.  For  die  low  error  rates  typical  for  local  area  networks,  it  was  shown  diat  the  expected  lime  of 
such  a  transfer  is  almost  identical  to  the  elapsed  time  when  no  errors  (Kcur,  regardless  of  the  retransmission 
strategy.  While  the  retransmission  strategy  has  little  effect  on  the  expected  lime  for  die  transfer  under  the 
assumption  of  low  error  rates,  it  has  a  dominant  effect  on  the  standard  deviation  of  the  elapsed  time. 

l  inallv.  in  Chapter  6.  we  propo.sed  a  mechanism  for  integrating  onc-to-niany  communication  into  mcs,siigc- 
passing,  dr.iwing  partially  on  the  broadcast  and  multicast  capabilities  of  loc.il  networks.  We  have  discussed 
die  issues  involved  in  addressing  groups  of  processes  and  in  providing  (dill'ercni  degrees  ol)  reliability  for 
onc-to-m.iuy  communication.  In  particular,  wc  have  argued  against  implementing  fully  reliable  one-to-many 
communication  at  tlic  kernel  level.  Some  initial  applications  of  one-to-many  communication  were  explored 
as  well. 
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7.2.  Future  Research 

Several  issues  were  left  unanswered  by  this  thesis.  An  important  class  of  remaining  open  questions  is 
concerned  with  the  validity  of  the  results  in  different  environments.  We  briefly  survey  some  of  the  issues. 

7.2.1 .  Internetworking 

The  current  implementation  of  the  V-System  runs  on  a  single  local  area  network.  No  provisions  arc  made  in 
the  current  implementation  for  operating  in  an  internet  environment,  l-ncapsulating  the  V  intcrkcrnel 
packets  in  internet  packets  rather  llian  Kthcrnct  data  link  layer  packets  provides  only  the  first  step  in  this 
direction.  As  indicated  in  Chapter  5,  tlie  priKCSS  identifier  allocation  mechanism,  and  to  a  lesser  degree,  the 
priKCss  kKution  mechanism,  rely  on  tlic  capability  of  being  able  to  broadcast  packets  under  certain 
circumstances.  Most  current  internet  protocols  do  not  admit  broadcasting,  so  cither  extensions  to  the  internet 
protocols  must  be  explored  or  the  V  intcrkcrnel  protocol  must  be  adapted.  Second,  both  tlic  latency  and  the 
error  rate  on  an  internet  arc  higher  than  on  a  local  area  network.  The  robustness  of  tlic  protocol  in  the  face  of 
higher  error  rates  and  longer  delays  must  be  improved.  Whether  all  of  this  can  be  done  without  unduly 
affecting  performance  (on  a  local  network)  remains  an  open  question. 

7.2.2.  Network  Demand  Paging 

In  Chapter  4  we  have  discussed  in  great  length  the  performance  of  sequential  file  access  across  a  local 
network.  Network  demand  paging  poses  a  different  set  of  challenges,  in  terms  of  workload  as  well  as  in  terms 
of  tlic  benefits  of  buffering,  etc.  Also,  while  the  advantages  of  a  shared  file  server  arc  quite  clear,  the  case  for 
a  shared  paging  server  seems  less  strong,  since  paging,  unlike  file  access,  has  little  notion  of  sharing.  'ITic 
sharing  aspects  of  a  paging  server  thus  seem  to  get  in  the  way  of  performance  without  providing  much  benefit 
in  return. 

During  the  design  of  the  prolcK'ol,  several  pcrfonnancc-rclatcd  decisions  were  made  based  on  intuitive 
notions  .ibout  iavoiding)  buffering.  The  cost  of  buffering  changes  drastically  when  demand-paged  machines 
arc  considered,  w  here  the  copy  operation,  inherent  in  hulfering,  can  poicniially  be  subsumed  by  mapping  and 
unm.ipping  pages.  A(  Ihc  lime  of  writing,  the  system  has  been  ported  to  a  ()8010-bascd  machine,  capable  of 
demand  paging,  although  the  current  implementation  does  not  lake  advantage  of  this  capability. 

7.2.3.  Smart  Network  Interfaces 

l•■in.llly.  more  Uian  once  we  lamented  the  inadequacies  of  currently  available  network  hardware.  Suggested 
modifications  r.inged  from  double  buH'cring  to  dedicated  network  interface  prix.essoi’s.  Double  bulTcring 
would  potentially  be  able  to  provide  daui  rates comp.irable  to  the  network  data  rate.  Intelligent  scattcr-gatlicr 
Dma  interfaces  arc  able  to  reduce  pnKcssor  utilization  on  transmission,  but  they  seem  to  be  of  little 
advantage  on  reception  of  packets.  Programmable  network  interfaces  have  tlic  potential  for  further  reducing 
the  processor  utilization  for  network  interprocess  communication.  Careful  hardware-software  integration  is 
required  to  be  able  to  uike  m.iximum  adv.intagc  of  tliesc  sophisticated  interfaces. 
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